Dieser Blogpost ist auch auf Deutsch verfügbar

AI agents write code. A lot of code. Fast. If you run multiple agents in parallel, you get thousands of lines of new code in just a few hours. Last week, one of my agents implemented a complete feature in an hour and a half – 4,500 lines, cleanly structured, merge request created. It just worked. Then I did some quick math: a thorough review would have taken me two, maybe three days.

This is the reality nobody likes to talk about: while I’m going through the first merge request, the agent is already producing the next 10,000 lines. If you want to read everything, you become the bottleneck. You’re slowing down exactly the productivity you were hoping to get from agents in the first place.

OpenAI demonstrated in an internal experiment what’s already possible: a team of three engineers built roughly one million lines of code in five months – without a single manually written line. 3.5 pull requests per engineer per day.[1] If you think you can handle this output with traditional code review, you haven’t done the math.

I don’t read everything anymore. I only look at the important things. But how do I decide what’s important? And above all: how do I take responsibility for code I haven’t fully read?

We’ve Seen This Before

But is this really a new problem? I’d argue: no.

A new team member joins a project, the original authors left long ago – and yet productive work continues. Responsibility is taken for new features, even though the new team member doesn’t know every line of the codebase. What makes this possible? Not the hope that the code will just work. But tangible things: a comprehensible architecture, a test suite that immediately flags when a change breaks something else, usable documentation, a CI pipeline that catches errors. Only then is brownfield development possible at all.

Agents are like these new team members – except that every new session requires a fresh onboarding. This works best when they only need to understand small parts of the application to make changes. What helps with that are the same things as in brownfield: a clear structure, good tests, and comprehensible documentation.

Exactly this environment – tests, architecture checks, documentation, CI pipelines – is what makes the difference. In the world of agents, a term has emerged for it: the agent harness.

What Is an Agent Harness?

Charlie Guo puts it succinctly: an agent harness is “the set of constraints, tools, documentation, and feedback loops that keep an agent productive and on track”.[2] The formula behind it: Model + Harness = Agent.[3] A good model alone isn’t enough – what makes the difference in code quality is the harness.

The core philosophy behind it is radically practical: mechanical enforcement instead of hope. Mitchell Hashimoto, who coined the term “harness engineering”, puts the principle like this: “Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again.”[4] Don’t trust that the agent will do better next time – instead, build feedback loops and deterministic checks that automatically catch the mistake next time around.

My conviction, which grows stronger with every week of practice: the harness is the most important part of the entire equation. In my experience, a solid model with a well-designed harness delivers better results than a top-tier model without guardrails. That’s why I invest the majority of my time not in better prompts, but in a better harness.

But what does a harness actually look like in practice?

What a Harness Looks Like in Practice

In my projects, the harness has evolved into a multi-layered safety net. Four layers that build on each other – each one catches what the previous one lets through.

Deterministic Guardrails – the First Line of Defense

Everything that can be checked automatically, I check automatically. On two levels: pre-commit hooks run the fast checks – unit tests, integration tests, architecture tests with ArchUnit, linting, and formatting. In the CI pipeline, the heavier artillery follows: E2E tests, security scans, static code analysis. The list grows with every mistake an agent makes – in the spirit of Hashimoto’s principle.

Pre-commit hooks are the critical mechanism. They are the gate: as long as these checks aren’t green, the agent cannot commit. No exceptions, no warnings – zero-warning tolerance. The agent hits the wall, gets the error message, and corrects itself.

AI Reviews – the Second Look

A second agent reviews the first agent’s code. This sounds redundant at first, but in practice it’s surprisingly effective – precisely because the review agent has its own independent context. It doesn’t know the creation process, only the changes, the ticket, and the acceptance criteria. On this basis, it checks whether the merge request fulfills the requirements, looks for architecture violations, and finds code smells that static analysis doesn’t catch.

When both the deterministic checks and an AI review show green, my confidence increases significantly. This doesn’t fully replace human review. But it adds a layer that works fast and consistently.

Selective Human Review – the Important Parts

This is where the real shift happens: I don’t read everything anymore. I specifically look at the core business logic. Is the agent really doing what I expect? Is the domain decision correct? Does the code accurately represent the domain?

Boilerplate, mapping code, standard patterns – I leave that to the harness. When the first two layers show green, I don’t need to dig in here. Over time, you learn where you need to look and where you don’t. This intuition develops with every feature you build with agents.

Product Testing – Does It Actually Work?

At the end of the day, only one thing matters: does the software do what it’s supposed to? I test the feature, check the behavior, click through the application. Preview environments that are automatically spun up per merge request make this easy – a quick way to verify the result before it goes to production. All quality metrics can be perfect – if the feature doesn’t do what the user needs, it’s all worthless.

When the harness shows green, the AI reviews pass, the selective review brings no surprises, and the feature works – then I can ship. Not blindly, not naively, but with a confidence built on multiple independent layers of verification.

But this harness doesn’t run itself.

What the Harness Is Not: a License to Look Away

Just throwing agents at a legacy project and expecting a 10x boost – that’s a pipe dream. However, if you’ve already been practicing good software engineering – tests, clean architecture, documentation, independently testable components – you already have most of the harness and can integrate AI agents relatively easily. If, on the other hand, you’ve inherited a big ball of mud, you’ll need to invest first: build documentation, establish testability, create structure. This isn’t a weekend project.

My experience shows: with a well-designed harness, the nature of control shifts. It’s no longer about reading every line. It’s about reading the right parts – the core business logic, the domain decisions. The rest you leave to the harness. What matters is targeted attention: how likely is a bug? How severe is it? How easily does the harness detect it?[5] The more the harness catches automatically, the less I need to check manually.

A well-built harness primarily catches structural problems and ensures that the code stays readable and manageable – for humans too. Because everything needs to be testable in isolation, I can look into individual parts at any time without needing to understand the big picture. What the harness doesn’t reliably catch: whether the feature actually does what it’s supposed to. This functional verification remains with the human.

The best litmus test: would you ship this if you were on call tonight?[6] If the answer is no, the harness isn’t strong enough – no matter how green the pipeline looks.


Whoever invests in their harness today can work at a speed tomorrow that wasn’t possible before – and still take responsibility for every feature they ship.


References

[1] OpenAI, “Harness engineering: leveraging Codex in an agent-first world”, February 11, 2026.

[2] Charlie Guo, “The Emerging ‘Harness Engineering’ Playbook”, Artificial Ignorance, February 22, 2026.

[3] Simon Willison, “How I think about Codex”, February 22, 2026.

[4] Mitchell Hashimoto, “My AI Adoption Journey”, February 5, 2026.

[5] Birgitta Böckeler, “To Vibe or Not to Vibe”, martinfowler.com, September 23, 2025.

[6] Birgitta Böckeler, “I Still Care About the Code”, martinfowler.com, July 9, 2025.