Dieser Blogpost ist auch auf Deutsch verfügbar
If you let agents work without tests, linting, or architecture constraints, you’ll quickly see where that leads. The code compiles, maybe even works, but nobody can say what happens when you change something. Structure erodes, boundaries blur, and after a few weeks you have a classic Big Ball of Mud, except this time it was created in record time.
The real problem isn’t that agents write bad code. They write the code you allow them to write, and without guardrails, everything is allowed. That doesn’t scale, especially not at the speed agents bring to the table. Vibe Coding, generating code on demand without systematic control, feels like a huge productivity boost at first. But without direction, you’re mainly producing technical debt faster.
The question is actually simple: If I’m no longer reading every line, how do I still take ownership of the code? The answer is closer than you might think, because we’ve solved this problem before.
This Problem Isn’t New
Imagine you’re joining an existing project. The team has changed, the original developers are long gone, and you inherit 200,000+ lines of code. You don’t understand every line, not even every module. Yet you still ship features, fix bugs, and take responsibility. That’s normal brownfield work, and every one of us has done it.
Nobody would expect you to read and understand every line in a brownfield project before you’re allowed to open a pull request. Instead, you rely on systems: tests tell you if you broke something, linters enforce conventions, CI pipelines catch errors before they reach production. You understand the part you’re working on and trust that the rest is covered.
And that’s the key insight: I don’t need to know every line for ownership. I need to be confident that my change works and doesn’t break anything.
AI-generated code is essentially code from a colleague who’s no longer in the room, except this colleague produces significantly faster and significantly more. What makes brownfield possible in the first place is good modularization: loose coupling, high cohesion. Well-defined modules let you understand a small part and change it safely without keeping the entire system in your head. Agents benefit from this just as much, because the better the module boundaries, the more focused and error-free they work.
The knowledge of how to own code you didn’t write is already in us. The question is just how we transfer it to working with agents.
From Spreadsheet to Lint Rule
Boris Cherny, known today as the creator of Claude Code, provides a good example in the the pragmatic engineers podcast. At Meta, he was one of the most prolific code reviewers, and his method was surprisingly analog: a spreadsheet. Every time he left the same comment in a review, like “no any here please” or “error handling missing”, he added it to the list. When the same feedback appeared three or four times, he wrote a lint rule for it.
The principle behind it: repetitive human feedback becomes automated enforcement. Instead of saying the same thing in code reviews over and over, you build a system that says it for you. The reviewer becomes a system designer.
That this effort pays off is also measurable. An internal study at Meta showed that clean codebases have a double-digit percentage impact on engineering productivity. Half-finished migrations, inconsistent patterns, and outdated conventions don’t just slow down human developers, they confuse AI models just as much. An agent working in a codebase with three different error-handling patterns will reproduce all three. Consistency is the foundation for both humans and agents to work effectively.
Cherny’s spreadsheet method sounds almost trivial, but it contains a principle that takes on an entirely new dimension in the context of AI agents. More on that shortly.
The Agent Harness — More Than Just the Tool
If we take the principle of “turning recurring feedback into tooling” seriously, the question is: where does this tooling actually live? In which system do lint rules, tests, architecture checks, and prompts come together?
The answer is the agent harness. At its core, this is the agent program itself: Claude Code, Codex, or Cursor with its agentic loop. Through hooks, we can integrate directly and give the agent feedback on every action. But the harness also includes the agent workflow around it: the environment the agent works in. Through Git hooks and CI/CD pipelines, we can also intervene there and give the agent feedback before code is even merged.
The central question is: when I want to enforce a new constraint, at which integration point do I build it in? Two pyramids help with this decision, both inspired by the test pyramid.
The Tooling Pyramid covers the deterministic side. At the foundation are agent hooks, which run before or after tool calls. Because they execute on nearly every agent action, the integrated functions need to be extremely fast. Anything that takes longer slows the agent down on every single step. Above that are Git hooks (pre-commit, pre-push), which run less frequently and can therefore take a bit more time. Formatters, linters, or fast tests fit well here. At the top is CI/CD. That’s where checks go that take so long you wouldn’t want to run them locally as well, like comprehensive test suites or security scans. The principle is the same as with the test pyramid: what should always run must be fast. What’s slower belongs higher up.
The Prompt Pyramid addresses through prompts everything that can’t be checked deterministically. The sorting criterion here is the degree of specialization: the more general a piece of information, the lower it belongs, because it’s relevant for every task. The more specific, the higher up, because it’s only needed in certain situations. At the base are CLAUDE.md and global rules, which are always loaded in the agent’s context. In the middle are conditional rules, which apply when the agent works in specific areas of the codebase, and skills, which are loaded when the agent performs a specific type of task. At the top are docs, specs, and ADRs, available only on-demand when the agent actively needs them. At its core, this is context engineering: the right information at the right time in context, without blowing the token budget.
The decision rule between the two pyramids: If you can tool it, tool it. If you can’t, prompt it. What can be checked deterministically belongs in the tooling pyramid. Architecture decisions, domain conventions, or style questions are too loose for that and belong in the prompt pyramid. Deterministic enforcement is generally preferable to prompts.
The Harness Is Never Done
An agent harness isn’t something you set up once and then forget. It grows with every PR you review.
The cycle starts unremarkably: the agent produces a pull request, and in the review a problem comes up, for example an import from a layer that shouldn’t have access, or a test without assertions. Up to this point, normal developer life. But now comes the crucial question: Is there tooling that could detect this problem automatically?
If yes, it gets integrated: maybe a new lint rule or an ArchUnit test. From then on, the agent runs against this new tooling next time and gets feedback not in the review, but immediately. If the check fails, it fixes the problem itself, before a human ever sees it.
This is an incremental process, but the effect is cumulative: every problem that’s codified as tooling once never appears again. After weeks, the harness has dozens of such checks. After months, it’s a dense safety net that reflects exactly the quality requirements of the project.
This is where the circle closes back to Boris Cherny. He tracked recurring review comments in a spreadsheet and turned them into lint rules — exactly the same principle. Except that with agents it takes on a new quality, because the agent doesn’t just produce the code, it also helps build the tooling. The spreadsheet was manual, the feedback loop with agents is a self-reinforcing cycle. With enough feedback and time, agents converge on the right solution, not through magic, but through systematic feedback, consistently applied.
What Humans Still Need to Own
Tooling makes code correct, but it doesn’t check whether the code does the right thing. Linters catch style problems, type checks prevent runtime errors, architecture tests secure boundaries. But whether the business logic is correct, no hook can answer.
This is also the most subtle danger with agent-generated code: tests verify the agent’s assumptions. If the assumptions are wrong, the tests are still green. A concrete example: the agent implements a discount calculation and writes tests for it, but both are based on the same misunderstanding of the business requirement. The feedback loop doesn’t help here because there’s nothing to check against.
That’s why business logic stays with humans. Not every line and not every file, but the places where domain decisions are made. This has consequences for architecture: if you want this review to work efficiently, you have to design for it. Isolate the domain layer, separate business logic from infrastructure. Not because it’s theoretically elegant, but because it focuses human review on what only a human can check. The rest is covered by tooling.
The Mindset Shift — From Vibe Coder to Agentic Engineer
Vibe coders generate code on demand, without any quality assurance, and hope the result works out. Agentic engineers work just as fast with agents, but they invest in their harness: constraints, feedback systems, deterministic checks. They can focus on domain logic because their systems cover the rest. What separates the two isn’t speed, but control. And whoever has that control can ultimately take ownership of their systems.