TL;DR
- In Anthropic’s study, developers using AI were only slightly faster at solving unfamiliar tasks, but scored significantly lower on a follow-up comprehension quiz than those without AI.
- But the average result hides an important difference: some ways of using AI harmed understanding, while others supported it.
- The decisive factor was not AI use itself, but how participants used it: delegating coding or debugging to AI was linked to the weakest understanding.
- The strongest results came from generating code first and then actively questioning and verifying how it worked.
- These findings apply to immediate comprehension in a tightly time-constrained study, not to long-term skill development or everyday development work.
This post is part of a series.
- Part 1: Speed vs. Skill
- Part 2: AI and Elaboration: Which Coding Patterns Build Understanding?
- Part 3: Understanding AI Coding Patterns Through Cognitive Load Theory (this post)
This is the third post in “Developing with AI Through the Cognitive Lens,” a series exploring how AI coding tools affect the way programmers learn, work, and build expertise. Drawing on cognitive psychology research—particularly Felienne Hermans' work in The Programmer’s Brain—this series examines what happens to our skills when we delegate cognitive work to AI. The goal isn’t to reject AI, but to use it deliberately, making conscious choices about when it helps and when it hinders.
A study recently published by Anthropic, the company behind Claude Code, examined how AI coding assistants affect developers learning unfamiliar technology. The overall finding seems damning: participants using AI were only marginally faster, and this difference is not even statistically significant. At the same time, they score significantly worse on comprehension—50% versus 67% for those without AI assistance.
But the study results are much more nuanced than that. Within the AI group, researchers identified six distinct interaction patterns with comprehension scores ranging from 24% to 86%. Some patterns severely harmed learning, while others enhanced it beyond what manual coding without any AI assistance achieved. The difference wasn’t the tool, but how participants engaged with it cognitively.
In this post, I’ll examine these patterns through the lens of cognitive psychology, particularly cognitive load theory, to explain why they produced such different outcomes. I’ll also discuss the study’s limitations: what these findings can tell us about AI coding patterns, and where we must be cautious about generalising beyond this specific context.
Before diving into interpretation, let’s review the study design and results.
The study design
In the study, 52 engineers, most of them juniors, had to solve programming tasks in Python using the Trio library. All of the participants had prior experience with both Python and AI-assisted coding, but were not familiar with Trio.
The participants were split into two groups, a test group that had access to AI assistants, and a control group with no such help. All of the participants first had to solve two different programming tasks, implementing a feature using Trio. They were told to solve the tasks as quickly as possible, and they had a hard time limit of 35 minutes for doing so.
The programming tasks were directly followed by a quiz, where their understanding of the concepts they had to use to solve the programming tasks.
The results
For the detailed results, please refer to the study. Here, I want to highlight the most relevant and striking results:
- Participants in the test group were two minutes faster than those in the control group, on average. However, this result is not statistically significant.
- On average, participants in the test group scored 17% lower than the control group in the quiz that followed the programming tasks, which corresponds to about two letter grades. Unlike the speed increase, this result is statistically significant.
- In the test group, six distinct AI interaction patterns emerged with very different outcomes for both productivity and learning.
Interaction patterns that hinder learning
Of the six interaction patterns the study authors identified, three led to low scores in the quiz, showing a poor understanding of the technology they had just used. On average, participants using one of these failing patterns reached scores of less than 40%, whereas the average score among all participants in the test group was 50%.
Let’s look at these interaction patterns in more detail.
AI delegation and progressive AI delegation
Four participants relied entirely on AI to write code. They described what they needed and let the AI generate the solution. This group completed tasks fastest and encountered few errors. Four other participants started by asking the AI assistant one or two questions, but then decided to delegate all code generation to the AI. In the quiz, the AI delegation group scored at 39% and the progressive AI delegation group at 35%.
Why did these participants score so poorly in the quiz? What cognitive psychologists have found out about how learning works can explain this. These two patterns bypass elaboration entirely—a process in which you connect new information to your existing mental models.
When AI generates code without your engagement, critical learning processes never occur. You don’t compare alternatives, you don’t adapt the code, and you don’t encounter errors.
The manual coding group hit many errors that, according to the authors, “mapped directly to topics tested on the evaluation.” The AI delegation group avoided these errors, which felt efficient but eliminated the friction that helps with encoding concepts into memory.
Iterative AI debugging
Four more participants used an interaction patterns the authors of the study call iterative AI debugging:
Participants in this group relied on AI to debug or verify their code. They asked more questions, but relied on the assistant to solve problems, rather than to clarify their own understanding. They scored poorly as a result, and were also slower at completing the two tasks.
Participants using this interaction pattern scored at 24% on average in the quiz, so they showed the worst understanding of Trio from all six interaction patterns. Interestingly, they were also notably slower at completing their two tasks.
When you ask AI to diagnose and solve errors, you outsource the cognitive work that builds not only debugging skills but a mental model of how the technology you use works—in this case Trio. If you don’t spend time thinking about what could cause an observed error, and don’t systematically test your theories, there is nothing to connect your possibly poor mental model of how that technology works.
Interaction patterns that help with learning
The good news is that not all ways of interacting with AI assistants are bad. The authors of the study identified three interaction patterns whose behaviours led to an average score of 65% or higher. Let’s look at how these behaviours differ from the previously described patterns and how we can explain their benefit for building understanding.
Generation-then-comprehension
The two participants in this group used AI for generating code, but after doing so, they did not move on to the next task. Instead, they continued by asking the AI assistant questions to verify their understanding of the solution. In contrast to the AI delegation group, they were slower, but achieved the highest comprehension scores in the entire study: 86% on average—significantly better than even the control group’s 67%.
How can we explain the difference in understanding? The key difference from AI delegation: the cognitive work that happens after generation. The AI reduces the friction of syntax and boilerplate, but the learner maintains full cognitive ownership of understanding. This combines the efficiency benefits of AI generation with the learning benefits of active interrogation. Generating hypotheses about how code works and testing them strengthens memory more than passive reading. The participants weren’t just accepting AI output. Instead, they were actively building and testing their mental models against it.
What’s remarkable is that this interaction pattern scored 19 percentage points better than solving the tasks without AI. This shows that in some circumstances, conscious AI use can enhance learning, not just preserve it.
However, this pattern requires discipline. The temptation after generating working code is to move on. These participants resisted that temptation and invested time in verification. The slightly slower completion time bought significantly deeper understanding.
Hybrid code-explanation
Three participants used an interaction pattern that the authors call hybrid code-explanation. They asked the AI assistant to generate both code and the appropriate explanations. The participants took their time to read the explanations, so they were slower than the AI delegation group, but scored better in the quiz. With an average score of 68%, they were in the same range as the control group.
By requesting both code and explanations, participants got the solution and the reasoning behind it. The explanations made expert reasoning visible, showing not just what works, but why it works. This provided connection points to existing mental models and helped participants understand the concepts behind the implementation.
However, this pattern doesn’t involve true elaboration in the sense of actively connecting concepts to prior knowledge. Participants read explanations provided upfront rather than generating their own understanding through questioning and active processing. The explanations provide valuable context, but the learning remains relatively passive. This is more like reading a well-explained textbook than actively working to build understanding. This distinction helps explain why hybrid code-explanation, while effective, didn’t achieve the highest comprehension scores.
Conceptual inquiry
Seven participants used the AI assistant exclusively for asking conceptual questions. They then used their improved understanding of Trio to implement a solution for the respective task. They encountered many errors, but did not rely on the AI to resolve them, instead coming to a solution on their own. In the quiz, their average score was 65%, in the same range as the control group.
This pattern preserves elaboration through both questioning and manual implementation. By asking conceptual questions, participants connected new Trio concepts to their existing knowledge. By coding manually and debugging independently, they engaged in active generation and error-driven learning—all mechanisms that support encoding into long-term memory.
The score of 65%, similar to the control group’s 67%, suggests this pattern is effective for building understanding. Both groups engaged in the same fundamental learning processes: forming hypotheses, testing them through implementation, and refining mental models through error correction. The AI served primarily to clarify concepts more efficiently than searching documentation, but the cognitive work of elaboration remained with the learner.
Understanding the exceptional results
We’ve seen how elaboration explains the fundamental differences between these patterns. The three failing patterns bypassed elaboration entirely, while the three successful patterns preserved it through different mechanisms. This accounts for why some patterns lead to learning while others don’t.
But elaboration alone doesn’t explain why generation-then-comprehension achieved a quiz score of 86%, which is substantially higher than both conceptual inquiry (65%) and even the control group (67%).
All three preserved elaboration, yet the outcomes differed significantly. To understand why one elaboration-preserving pattern outperformed the others so dramatically, we need to examine cognitive load theory and how the study’s 35-minute time constraint shaped these results.
A cognitive load theory primer
Cognitive Load Theory, developed by educational psychologist John Sweller, explains how the limitations of working memory affect our ability to perform cognitive tasks. Our working memory can only hold a limited amount of information at once. When it becomes overloaded, both task performance and learning suffer.
Sweller identified three types of cognitive load:
Intrinsic load is the inherent complexity of the task itself. In the Anthropic study, intrinsic load came from understanding the problem requirements and the conceptual complexity of asynchronous concurrency. This load can’t be eliminated. It’s fundamental to the task and concepts involved.
Extraneous load is cognitive effort wasted on things that don’t contribute to completing the task or understanding the concepts. Poor documentation, confusing error messages, a user interface full of distractions, unfamiliar syntax and concepts, or navigating unfamiliar tools, all create extraneous load. Unlike intrinsic load, extraneous load should be minimised since it consumes working memory capacity without advancing either task completion or learning. Whether extrinsic load is low or high for the same task can differ depending on which mental models and schemata you have already built through experience.
Germane load is the productive cognitive effort invested in building understanding—constructing mental models and forming durable knowledge structures. This includes elaboration, schema construction, and integrating new patterns into long-term memory. Germane load is what enables lasting learning that transfers beyond the immediate task.
The three types of load compete for the same limited resource. If intrinsic load is high (complex task, unfamiliar concepts) and extraneous load is also high (poor tools, unclear documentation), little capacity remains for germane load, the mental work that builds the understanding tested in the quiz. If extraneous load is low, more working memory capacity is available for germane load, helping with integrating new concepts into long-term memory.
Explaining the results through Cognitive Load Theory
Cognitive load theory helps explain why the 35-minute time constraint in the study mattered so much. How participants allocated their limited cognitive resources determined both their task completion speed and their learning outcomes.
The key insight: patterns that bypassed the intrinsic load of the implementation task and minimised extraneous load freed up more capacity for germane load. In a time-constrained context, this allocation of cognitive resources proved decisive. Generation-then-comprehension participants could spend nearly all 35 minutes on learning, while others had to split time between solving the problem, implementing it, and learning concepts.
Generation-then-comprehension’s exceptional comprehension score becomes clear through this lens. By having AI generate the solution, participants completely bypassed the intrinsic load of figuring out how to implement the features. They also avoided the extraneous load of wrestling with unfamiliar Trio syntax and patterns. This meant they could dedicate virtually all 35 minutes to germane load—actively questioning the code, building mental models, and connecting Trio concepts to their existing knowledge. The time allocation was optimal for learning: zero minutes on implementation, maximum minutes on understanding.
Conceptual inquiry and the control group (scoring at 65% and 67% respectively) performed similarly because both faced the same cognitive load demands. Participants had to manage the full intrinsic load of solving the implementation problem themselves, handle the extraneous load of unfamiliar Trio syntax (conceptual inquiry) or documentation searching (control), and still try to learn the underlying concepts, for instance through dealing with errors. Their 35 minutes was split across multiple competing demands: designing the solution, implementing it, debugging errors, and building understanding. Less time for germane load meant less learning, reflected in their lower scores.
Hybrid code-explanation (scoring at 68%) sits between these extremes. Like generation-then-comprehension, it bypassed task intrinsic load through AI-generated solutions and minimised extraneous load. Participants could spend their time studying rather than implementing. However, as mentioned previously, the reading of provided explanations engaged working memory less intensively than the active questioning in generation-then-comprehension. The difference between 68% and 86% reflects the difference between passively reading explanations and actively constructing understanding through self-directed inquiry. This is completely consistent with what we know from the science of learning.
The 35-minute constraint amplified these differences. In a time-unlimited setting, the costs of managing intrinsic and extraneous load might matter less. But with a hard time limit, how cognitive resources were allocated became the determining factor in learning outcomes.
What we can and cannot conclude
The Anthropic study provides valuable data about how AI interaction patterns affect immediate comprehension in time-constrained learning tasks. However, the study’s specific contex limits how broadly we can apply these findings. Skill development involves more than quiz performance happening immediately after the tasks. It requires procedural fluency, debugging capability, long-term retention, and the ability to transfer knowledge to new problems. Moreover, durable encoding into long-term memory requires time and repetition. A single 35-minute session with immediate testing doesn’t capture whether learning persists. Let’s examine what the study does and doesn’t tell us.
The study’s specific context
The following constraints shaped which interaction succeeded:
- Time constraint: Hard 35-minute limit for completing tasks
- Task type: Implementing features with an unfamiliar library (Trio)
- Prior knowledge: Participants experienced with Python (1+ year weekly use), unfamiliar with Trio
- Population: Mostly junior engineers with AI coding experience
- Measurement: Immediate comprehension test following task completion
As we’ve seen through cognitive load theory, the 35-minute time limit made generation-then-comprehension optimal because it allowed participants to spend virtually all their time on germane load. But this finding may not generalise beyond these specific conditions.
What we can confidently conclude from the study
The study provides strong evidence for several important findings:
- How you use AI matters fundamentally. The 24% to 86% range in comprehension scores shows that interaction patterns, not just AI presence or absence, determine learning outcomes. Some patterns severely harm understanding while others enhance it.
- Elaboration distinguishes effective from ineffective patterns. Patterns that bypassed elaboration (AI delegation, progressive AI reliance, iterative AI debugging) led to poor comprehension. Patterns that preserved elaboration through questioning or manual implementation maintained or improved learning.
- In time-constrained contexts with unfamiliar libraries, generation-then-comprehension optimises learning. When developers need to quickly understand new frameworks or libraries, generating code then actively questioning it produces superior conceptual understanding compared to other approaches.
- Passive consumption versus active construction matters. Even among patterns that bypassed task intrinsic load, active questioning (86%) outperformed passive reading of explanations (68%), consistent with established learning science about the generation effect.
What we cannot conclude from the study
The study’s limited context means we must be cautious about broader generalisations:
- We cannot conclude that generation-then-comprehension is optimal for junior developers learning foundational programming concepts. The study tested developers who already had Python foundations learning an unfamiliar library. For developers still building foundational understanding the cognitive load landscape differs. When basic syntax, control flow, and fundamental concepts still require conscious effort, the extraneous load doesn’t come only from unfamiliar library patterns but from the programming fundamentals themselves.
- We cannot conclude anything about long-term skill development. The study measured immediate comprehension through a quiz. It didn’t test whether participants could implement similar solutions independently days or weeks later, debug their own code, or transfer understanding to new problems. These longer-term outcomes may favour different interaction patterns.
- We cannot conclude much about procedural skill development. The quiz mainly tested conceptual understanding. Manual implementation builds muscle memory, debugging patterns, and the procedural knowledge that comes from repeated practice.
- We cannot assume these results hold without time constraints. The 35-minute limit amplified the advantages of patterns that bypassed task intrinsic load. In realistic development contexts where developers have hours or days for tasks, the time allocation advantage of generation-then-comprehension diminishes.
Why realistic junior developer contexts may require different patterns
Understanding the study’s limitations through cognitive psychology reveals why junior developers might need different approaches:
Time constraints don’t reflect reality
Junior developers rarely work under hard 35-minute limits. When time isn’t scarce, the cognitive load optimisation that made generation-then-comprehension excel becomes less decisive. Spending time on implementation doesn’t compete with learning. It becomes a crucial part of learning.
Procedural fluency requires practice
The study tested conceptual understanding but not procedural skills: Can you implement similar solutions? Do you recognise when to apply these patterns? Can you debug independently? These abilities develop through repeated manual practice, not just through understanding explanations. The generation effect and deliberate practice research suggest that actively doing implementation work, even when it’s slower, may build more durable procedural knowledge than studying AI-generated solutions.
Error-driven learning builds debugging skills and mental models of failure
Generation-then-comprehension participants avoided errors, which helped them score well on conceptual questions. But encountering errors, forming hypotheses about causes, and debugging independently builds crucial mental models of how systems fail. These failure models are essential for validating AI-generated code and debugging production issues.
Long-term retention may differ from immediate comprehension
Learning science research on the generation effect and desirable difficulties suggests that more effortful learning—like implementing code yourself and debugging errors—often produces better long-term retention even when immediate performance is lower. The study’s quiz measured understanding immediately after the task. We don’t know how well participants retained this knowledge days or weeks later, or whether they could transfer it to different problems.
Foundational learning differs from learning new libraries
The study participants had solid Python foundations, so the extraneous load came primarily from unfamiliar Trio patterns. For junior developers learning foundational programming concepts, extraneous load is high across the board. The optimal interaction pattern for building foundations may differ from the optimal pattern for adding new libraries to an existing foundation.
The study provides valuable insights about AI interaction patterns in specific contexts. But realistic junior developer skill development involves longer timeframes, needs for procedural fluency, importance of error-driven learning, and requirements for long-term retention that the study didn’t measure. Different contexts may call for different patterns, even if those patterns don’t maximise immediate quiz performance under time pressure.
Conclusion
How you use AI matters more than whether you use it. The same tools produced comprehension scores ranging from 24% to 86%. Interaction patterns that bypassed elaboration failed, while those that preserved it succeeded. In the study’s time-constrained context, generation-then-comprehension excelled by maximizing time for active learning.
But context shapes outcomes. The study’s 35-minute limit and focus on immediate comprehension don’t reflect realistic skill development over months and years. Understanding the cognitive mechanisms, notably elaboration, cognitive load, active engagement, deliberate practice, and spaced repetition, helps you choose interaction patterns that support learning and understanding in your specific context, rather than blindly following findings from a single study.