Dieser Blogpost ist auch auf Deutsch verfügbar
The introduction of good AI tooling allows us as developers to feel more productive as we quickly produce more code and check off tasks. Unfortunately, increasing developer output has not been found to increase productivity at the company level (see this report) because the bottleneck shifts from the time spent writing the code to the time spent reviewing the code.
The question then becomes: how can we help make reviews easier to manage?
This isn’t a new problem. It has always been possible to create merge requests which are much too large for a human to easily understand and review, but because it is now so much easier with AI to generate so much more code, I think the problem is likely to become more serious in the future.
Ideally, as a software developer, one of the most valuable skills we can learn is how to break a task down into chunks which can be easily reviewed and merged. This is also something we learn and improve on as time goes on: when we’ve done similar tasks multiple times, we develop a feeling for how much code we can manage in a meaningful way and we learn to split that task into manageable chunks.
In the context of LLMs, keeping the context small is actually also critically important because LLMs begin to behave erratically when their context grows too big. And unfortunately, they don’t behave themselves. They may tell you that the large refactoring that you are doing is really complicated and suggest that you take over and complete the remaining steps manually. Or they may change something which causes the test to fail and then become convinced that the failing test has nothing to do with their changes and doesn’t need to be fixed.
When working with an LLM, each prompt that you give it will be added to the system context which then determines the output that the model will give you. The LLM needs the correct context in order to be able to provide you with a meaningful answer.
This is one of the benefits of the agentic programming paradigm — the agent is able to retrieve relevant context from your actual code base to feed to the LLM based on and in addition to the prompts that you give it.
The difficulties of managing system context
That detailed context is one of the reasons why agentic programming can be so effective and work so well.
The downside is that a lot of “noise” can be added to your context during a programming session. If a lot of big files have to be read in, those will land in your context. If there are a lot of error messages that the LLM has to read and interpret, those all land in the context. And if you get sidetracked and ask about something which isn’t relevant to your task at hand, those end up in the context too!
There are concrete limits to the size of a context that an LLM supports, but it is important to know that the performance of the model will decline as the size of the context increases (see report here). At a certain point, “context rot” sets in and the results will likely not be as expected (or desired).
With this in mind, we can keep our system context small by being as explicit as possible when referencing files or by only pasting relevant error messages or snippets into the console. We can also ask an LLM to summarize the session so far and begin a new conversation with that summary to remove extraneous information. Another possibility would be to delegate certain tasks to a sub-agent which will work with it’s own system context (i.e. a subagent could be charged with searching through documents and only return ones relevant for the task at hand so that the other files do not get added to the main system context thread).
In practice, I do not usually reach the limits of my system context. There have been a few examples where the task I assigned an agent really did push the system context to the limits and I began having poor results, but the vast majority of the time, any “context rot” that I have experienced has usually resulted from me forgetting to start a new session with a new context. Most of the time the system context is more than sufficient to implement the tasks that I set for it.
Repairing a vibe coding experiment gone wild
Far more often, I find the challenge in the other direction much more challenging: How can I keep the amount of code generated by a coding agent small enough that I am able to comprehend it and evaluate it for correctness?
In one experience I had, I experimented using vibe coding to generate a proof of concept for a complex user interface for my project. I iterated until I got to a version of the UI that I liked, but the result was that I had loads of generated code sitting around in my working directory and it was more code than I could mentally comprehend in a single sitting and some of it felt not quite finished.
This is an example of my mental context window being completely overloaded with the scope of the work in question. I was so tempted to just push it all in a huge merge request and comment in a few places that I knew this wasn’t quite complete, but that I would iron out the details later.
But is that fair? To push that cognitive load onto my colleagues who have to wade through everything and hope that they don’t oversee the needle in the haystack?
No. That’s not fair.
To fix the issue, I used a modified version of my git workflow that I have described in detail in another post to try to sort through the different code and package it into small work items which would be possible to easily review. I asked my AI agent to give me an idea of how many distinct features were contained in my current directory. The agent thought 3. (In the end, it turned out to be 5).
The crucial point here was to reduce the scope of the work I needed to work on at that given moment. I chose to focus on one specific refactoring that had been done together with everything else. Then I used git add -p to interactively look at every change in the code base. Because my focus was on that specific feature, the question “does this belong to feature X” became a simple yes/no question instead of having to evaluate the correctness and quality of each and every change. Once I had added those changes, I used git stash -u to stash away everything else. Then I had a look at the staged changes with git diff --staged and did another code review of that piece of work. If tests were missing, I added them. If tweaks were needed to the code, tweak them I did. Only once that piece of code was complete (in my eyes), did I create a merge request for that change.
Once I had completed that task, I used git stash pop to retrieve all of the other changes that I had made together with my AI agent. It was still TOO MUCH! But this time at least less than it was before and a little bit less is a little bit better.
I continued this method until I was able to pull the different features apart, verify them, and then provide a proper merge request. This was an instance where the apparent productivity gain from the AI agent was completely swallowed up by the mental overhead I had to invest in order to pull the pieces apart into comprehendable pieces.
I have learned from this experience and now keep a much closer eye on the amount of code that has been created so that I don’t end up making the same mistake.
Keep your scope small
The best mitigation for large context is prevention! Keeping the scope of our tasks as small as possible will result in a coherent piece of work: both for our brains and for the LLM. When I hear super snazzy new buzzwordy terms like “context engineering”, it kind of makes me cringe (as I’ve stated previously, I’m a bit allergic to buzzwords), but the skill of “context engineering” boils down to keeping the scope of a given feature as small as possible and trying to keep as much irrelevant and unnecessary information out of the context as possible to ensure that the LLMs can perform optimally.
What this means practically is that it is better to let an agent perform small, sequential steps that build on each other instead of writing a general prompt and hoping for the best. Using the plan mode available in agents like Claude Code and Codex can be helpful for this. The principles of test-driven development apply here as well: it is best to write the tests in advance and then ask the agent to write an implementation which fulfils the tests. Tests act as guard rails to ensure that the resulting implementation really fulfils the requirements that we need to meet.
This also means that we will be committing and pushing our changes much more frequently than we are used to! As I wrote about in my post about my git workflow, I strongly believe that each commit should comprise a work item which is small enough to comprehend. In the past, it probably took a minimum of 2–3 hours of programming to reach that amount of work, but with our AI agents we could well reach that point at 15 minutes!
When this happens, it is best to pause, polish, commit and push! Because the amount of work created is small enough for you to understand, it should also be small enough for your colleagues to review with ease.
The resulting code throughput will still be higher than traditional software development without AI, but when packaged into smaller chunks of code, it should be easier to quickly bring those changes to production without any dangerous handwaving “LGTM” commit reviews when the scope is larger than the human brain can handle.
Do not course correct
Because of the nature of the system context, once an agent has begun to do things incorrectly or in a way that we do not like, it is very difficult to convince that agent to do it in a different way. We need to remember that the system context that is fed into the LLM includes the entirety of our session up to that point, so the next call to the LLM that we make will include the entirety of the “faulty” implementation and THEN an instruction to “please not do it this way!” at the very end of the context.
No wonder that the agent has a difficult time letting go of its imperfect ways!
When we have broken our task down into sequential steps and are keeping an eye on the progress of our agent, if we notice that things are going off the rails, it is better not to course correct, but instead to abandon ship. Kill that session and make a new one with new, more detailed prompts, about how to best implement your feature.
If in doubt, throw it out
Recently, I have been making “WIP” commits after my agent has successfully completed a small part of my task and I am satisfied with the work. This make it easy for me to just delete any code that the agent has created which I do not like! If we restart an a session with our agent instead of course correcting but do not delete the code, that imperfect code that we do not like will immediately land in the system context again and send our agent down the wrong course again.
The sunk-cost fallacy can also come into play: we are primed to believe that existing code must be valuable (because we are used to having to invest our own time into writing it). This is less true today than it ever was before: if our agent was able to create that code for us in 10 minutes, it will easily be able to recreate a similar piece of code in the exact same amount of time. And this time, because we have learned from the mistake of our last instruction, we can write a prompt which will be more likely to produce results that we like.
In my example from before, I would likely have been able to achieve the same exact result by completely redoing the five tasks in five separate agentic programming sessions – potentially in less time, with less headache.
A new way of iterating
In a way, this method of taking smaller steps and iterating on our code feels very familiar: instead of writing a test and then iterating on a few lines of code to get that test to pass, we instead write a prompt for a short task and tweak or throw away the response until we are happy with our code.
This approach will allow us to achieve productivity gains by using AI without losing it again during the review process. As we continue learning how to use AI in our workflow, we will adjust to the new normal and learn how to commit and push quicker without losing steam. We owe it to ourselves (and our colleagues) to maintain our work in a small enough form that we are able to mentally comprehend it, without feeling like we need to sacrifice anything in the process.