Where AI Helps (and Hurts) Across Different Coding Scenarios

Dieser Blogpost ist auch auf Deutsch verfügbar

TL;DR

AI productivity gains in software development are highly context-dependent — take vendor promises with a grain of salt
Greenfield + simple tasks = maximum gains (35–40%); legacy + complexity = next to nothing (0–10%)
Popular languages (Python, Java) benefit far more than niche ones (COBOL, etc.), thanks to a much larger pool of training data
With niche languages and complex tasks, AI can actually hurt productivity — hallucinations and faulty output become the norm
“Legacy code” and “niche language” are a toxic combination: together, they push you straight into the “AI Hell” quadrant with minimal to no productivity gains
Bottom line: if you’re working in large legacy system with niche languages (which, let’s be honest, a lot of us do), think twice before leaning too heavily on AI tools

Where do I even start when trying to measure productivity gains from using Large Language Models (LLMs) in software development? In this short analysis, I simply draw on data from Yegor Denisov-Blanch’s talk Does AI Actually Boost Developer Productivity? (Stanford 100k Devs Study). In it, 136 teams from 27 countries were surveyed on whether they see productivity improvements from using AI (more precisely: LLM-assisted software development).

The following charts are relevant for my take on the “what really matters” factor; I’m repeating them here and adding my interpretation.

Chart I: Context Is the Brake

One of the most interesting insights from the talk, for me, is a 2x2 matrix that shows in which situations AI support actually adds productivity value for software developers. Instead of making blanket statements about AI productivity, the matrix breaks the question down along two dimensions: how mature the codebase is and how complex the task is. The results are more nuanced than the usual promises in glossy brochures (or on websites) from various AI tool vendors would have you believe.

2x2 matrix; explanation and conclusions in the text below — Productivity gains from AI usage by project maturity and task complexity

My interpretation

The matrix shows that productivity gains from AI are highest in greenfield projects with low task complexity—study participants report an uplift of 35–40% there. To me, the reason is obvious: low-complexity tasks are often repetitive and clearly defined, so AI can reliably generate boilerplate-heavy code with minimal risk of errors. Also, I suspect we’re in the realm of to-do list apps here: written a thousand times, and nothing new happens the thousand-and-first time.

However, the gains drop sharply as project maturity increases and/or task complexity rises (i.e., as soon as things get serious):

In brownfield and legacy projects, the gains fall to 15–20% even for simple maintenance tasks, because outdated code and complex dependencies limit what AI can contribute safely.
For highly complex tasks in systems that already resemble a Big Ball of Mud, the gains shrink to just 0–10%, because the AI struggles to cut through tangled architectures, poorly implemented ideas, and deeply nested logic.

This doesn’t surprise me: the underlying training data largely comes from public code repositories. There’s a clear bias in what gets shared—code you don’t have to be embarrassed about in public (at least that’s true for me). The real bulk of code that deviates from those idealized images sits inside companies’ closed software systems. An LLM’s first encounter with that kind of code can be jarring, which makes it harder to adapt known patterns from the training data to the existing codebase. Or, as Ludwig Wittgenstein put it more than a hundred years ago:

The limits of my language mean the limits of my world.

Even in ideal greenfield environments, highly complex work caps AI’s impact at 10–15%, because those tasks require deeper human judgment that mechanical automation can’t replace. AI can assist, but it still can’t replace architectural thinking and contextual judgment—both of which complex engineering and domain knowledge demand. That also ties back to the limited amount of available context capacity (see my assessment in “Agentic Software Modernization: Chances and Traps”).

TL;DR: AI delivers the most when the problem is tightly scoped and the codebase is clean. High task complexity and legacy code are the two main productivity killers for AI—especially in combination (which is likely the reality for most of us).

Chart II: The Niche Penalty

The second chart shifts the perspective from project maturity to the choice of programming language. It turns out that the popularity of the language has a significant impact on how much an LLM can actually help—mainly driven by how much training data exists for that language.

My interpretation

With widely used languages (e.g., Python, Java), LLMs provide the most benefit: for simple tasks, they boost productivity by 20–25% thanks to extensive training data (e.g., via reinforcement learning on thousands of simple question–answer pairs); for complex tasks, by 10–15%. In popular languages, LLMs can still provide solid support because they’ve seen huge amounts of diverse training data. But even in the best case, complex tasks still require human judgment—so AI acts more as an accelerator than a replacement.

With niche languages (e.g., COBOL—though to me that’s already mainstream), the gains for simple tasks are negligible at 0–5% (due to limited training data). For highly complex tasks, things get even worse: productivity can drop to as low as -5%, because the AI enters a hallucination-prone zone where it confidently produces plausible-sounding but incorrect output. This highlights that AI tools without sufficient training data can become a liability rather than an advantage in complex development work. Personally, I don’t see this changing for the better anytime soon. It’s also becoming clear that even actively asking for code in niche programming languages doesn’t lead to enough high-quality training data (and honestly: what insurance company wants to put its COBOL-written computational core on GitHub?).

The underlying driver in all four quadrants is the same: the more training data exists for a given language and task type, the more reliably AI can contribute. Language popularity is therefore not just a matter of personal preference—it’s a direct indicator of how productively you can use LLM-assisted software development.

Chart III: Heaven or Hell

For the third chart, I rather pragmatically combine the average productivity gains from the two previous 2x2 charts into a third perspective. It shows productivity gains split by programming language popularity and project maturity. I’m particularly interested in this view for a concrete reason: I sometimes work in projects that use programming languages that don’t even make it into the top 50 most popular languages in the TIOBE Index (https://www.tiobe.com/tiobe-index/)—and languages that will never show up there because they exist only within a single company. And of course, it’s worth mentioning: these are decades-old, massive software systems that are now slowly due for modernization.

Note: This combined view is not a formally validated model. It’s a pragmatic thought experiment that connects two independent data sources by simply averaging them. It’s meant to provide orientation—not a precise prediction.

My interpretation

When you combine both dimensions—project maturity (greenfield vs. brownfield) and programming language popularity—you get four interesting quadrants. The best-case scenario, “AI Heaven,” happens when you’re working on a greenfield project in a widely used language: that’s where you can expect the highest productivity gains. It’s the ideal state: ample training data meets a clean, unburdened codebase. AI can reach its full potential. That’s why vibe coding and prototyping with languages like TypeScript and friends works so well.

In brownfield projects written in popular languages, gains drop noticeably. Now you’re paying the price for letting code hygiene best practices slide (Yegor Denisov-Blanch also has an excellent talk on this: “Can you prove AI ROI in Software Eng?”). A Large Language Model still understands the well-known programming language just fine, but the complexity and technical debt in the existing codebase limit what it can contribute.

Interestingly, niche languages in greenfield projects still deliver noticeable gains—only slightly worse than the legacy-code scenario in popular languages. That suggests a clean codebase can partly compensate for weaker training data, although the language barrier still sets a meaningful ceiling. My bias here is that it’s simply always easier to start on a blank slate, no matter which language you use (I still remember the time when people kept saying “we’re just faster with Scala / F#,” which left me unimpressed even back then. It gets interesting once you have a mountain of code that goes beyond a to-do list).

The worst-case scenario is “AI Hell”: a niche language combined with a brownfield codebase yields only minimal productivity gains. Here, both obstacles amplify each other. The AI has neither sufficient training data for the language nor the ability to meaningfully penetrate a tangled legacy codebase—the result is unreliable output and a high risk of doing more harm than good.

The key takeaway: language popularity and project maturity both matter in their own right. And their negative effects add up. In other words, each dimension already reduces AI productivity on its own; together, they push AI-driven productivity gains down to the lowest level. Teams working with niche languages in legacy systems should be especially cautious about relying too heavily on AI tools (see, for example, my article "Software Analytics going crAIzy! ").

PS: Did I mention I’m a fan of the Boring Software Manifesto and have been preaching for years that people should join it? I believe that in the age of agentic software modernization, the manifesto is more relevant than ever. 😉

If you’re interested in the charts: the accompanying Jupyter Notebook, which generated the images based on the talk’s data, is available here.

Header image sourced from Wikipedia, Creative Commons CC0 1.0 Universal Public Domain Dedication.

Where AI Helps (and Hurts) Across Different Coding Scenarios

My takeaways from Stanford University’s Developer Productivity Study

TL;DR

Chart I: Context Is the Brake

My interpretation

Chart II: The Niche Penalty

My interpretation

Chart III: Heaven or Hell

My interpretation

TAGS