Dieser Blogpost ist auch auf Deutsch verfügbar
TL;DR
Experimentation culture is not a universal paradigm. It emerged in consumer internet companies with ad-revenue models. Most software projects do not exist in that environment.
Output is not outcome. A/B tests tell you whether a change moves a metric — not whether it is the right metric.
Metrics cannot replace user understanding. Implicit knowledge develops through observation and time, not structured interviews.
Invisible stakeholders are ignored. The people who bear the consequences of software decisions never appear in any A/B test.
Being there beats measuring. Direct experience and observation in the real context yield better understanding than testing.
Agentic development amplifies the problem. Faster iteration does not help when the underlying assumptions are wrong.
In my blog post “First Agile, Then Agentic”, I argued that AI amplifies existing organisational capabilities, and that faster experimentation is one of the potential benefits for well-positioned teams. The more quickly you can implement, the more experiments you can run in the same period of time, and the more you can learn. But faster experimentation is not automatically better. In many contexts, successful product development depends on entirely different factors.
Where experimentation culture comes from
It is worth taking a look at the environment in which the culture of experimentation actually emerged. It developed in the consumer internet companies of the 2000s and 2010s, where very specific conditions held: millions of daily users, a tight connection between product decisions and measurable outcomes, and a business model optimised for engagement and conversion. In that context, controlled experiments made sense. Statistical significance was achievable within days, and the metric being optimised — clicks, purchases, time on site — was closely tied to revenue. Lean Startup, growth hacking, and the DevOps movement all crystallised in this specific environment and were then packaged as universal methodology.
Today, experimentation culture is most at home wherever the same basic principle applies: social media, digital media offerings, free-to-play games — products whose business model is based on attention and ad revenue. In these contexts, users are not the customers. The advertisers are the customers, and the users are the product. Maximising engagement is not a proxy for user value — it is the actual goal. The tension between company outcomes and user outcomes simply does not arise.
Melissa Perri argues that good product work finds exactly this overlap: outcomes that create value for both the company and the users[1]. That is demanding work, and it requires understanding both sides. It also requires that users are the customers. Experimentation culture in the ad-revenue context has elegantly sidestepped this problem.
The problem is that most software projects do not exist in that environment.
Enterprise software and domain-specific applications have different success criteria, and a fundamentally different relationship between what can be measured and what actually matters. Perri calls the result the Build Trap: organisations that fixate on output metrics lose the thread between what they build and what users actually need. Features ship, velocity is high — but the software does not serve the real needs of the people who use it.
Output instead of outcome
Baldur Bjarnason describes in “Out of the Software Crisis”[2] what this looks like in practice:
“We decide on the problem without checking to make sure it’s a real problem for our end users. We then design without researching the nature and structure of the problem we’re trying to address. We ship without testing to see if it actually does the job it’s supposed to. Only then do we do some actual testing, often A/B tests. We throw two half-baked unfinished designs into a functional shipping application that people rely on to do their work and use Data™ to see which unmitigated disaster is marginally less disastrous for the working lives of those held hostage by our applications.”
The first sentence is the decisive one. Failing to check whether you are solving the right problem is output orientation in its purest form. You measure what is easy to measure, not what matters.
Teresa Torres makes the consequence clear: Continuous Discovery[3] is not about validating arbitrary hypotheses faster. It is about developing understanding before it is even clear what is worth testing. A/B tests tell you whether a specific change moves a specific metric. They do not tell you whether you are moving the right metric, or whether the metric connects to anything users actually care about.
Bjarnason, drawing on Deming[4], calls this tampering: reacting to symptoms as though they were causes. If the underlying assumption is wrong, A/B testing only optimises more thoroughly in the wrong direction. A/B tests are valuable when the conditions are right. Too often, however, they are used to replace genuine user understanding rather than complement it.
Invisible stakeholders
A/B tests at high frequency without an adequate foundation can at best land a lucky hit. So how does one arrive at genuine user understanding?
In my blog post “Hail Mary”, I argued that domain knowledge is often not consciously accessible and is socially distributed — what Polanyi calls tacit knowledge. The same applies to user knowledge. Users are often unable to articulate what they really need, but they know it when they experience it. Metrics are an attempt to replace this implicit knowledge with behavioural data.
In “Continuous Discovery Habits”, Torres recommends regular interviews with users and describes the pitfalls to avoid and how to ask the right questions to surface this implicit knowledge: general questions tempt people to activate their generalised self-image rather than reflect on their actual behaviour. People are poor at knowing what they typically do, but good at recalling concrete episodes. Instead of asking “What criteria matter to you when choosing a restaurant?”, you should say: “Tell me about the last time you went out to eat with someone.”
Tools like BMAD do ask discovery-oriented questions — about the problem, the market environment, and the users. But they compress a process that requires time. The answers that emerge are what respondents can consciously retrieve in that moment, not what develops through observation, iteration, and incubation over time. And even if BMAD were to interview the right people, the question structure would systematically surface what users believe they do, not what they actually do. The implicit knowledge remains invisible.
Many organisations actively work against building this user understanding, deliberately shielding product teams and developers from the people who use their software.
And even when interviews with real users do take place, there is a further problem: software affects not only those who use it directly, but also people who never interact with the system at all. I call these people invisible stakeholders. Even when organisations do talk to users, these invisible stakeholders are almost never considered in design decisions.
An example of such an invisible stakeholder would be the person who stocks shelves at a supermarket. She probably has no access to the ordering software. But when that software miscalculates an order, she works overtime. When a new version changes how stock levels are displayed and the system behaves differently from what her manager expects, the confusion lands in the store — not in the product team’s retrospective. She is not a stakeholder in the conventional sense of the word. She is simply the one who bears the consequences. And whether the software serves her needs well or poorly will not be measured by any A/B test.
Being there instead of measuring
My fellow student Jörg Niesenhaus recently reflected in a LinkedIn post on his first weeks at ALDI DX. As part of his onboarding, he spent two weeks working in a store like any other member of staff: stocking shelves, operating the checkout, handling edge cases like broken freezer units and attempted theft. His friends at other IT companies asked why anyone would spend more than a day doing that. His answer: one day teaches you the basic processes. Two weeks teaches you the edge cases, the informal knowledge that colleagues share with each other, and above all how much work goes into selling a single yoghurt or cucumber. After seven years, he says, he can still trace decisions back to what he learned in that store.
This is not empathy. Empathy presupposes a distance. What ALDI DX built into their onboarding is more direct: knowledge that comes from being there and experiencing it yourself. You know what it means when the label printer jams because it jammed on you. You have experienced the cashier, the shelf stocker, the store manager who checks the order at six in the morning. You have felt firsthand what it means when the software does not know an edge case and you have to improvise the workaround yourself.
Not every company can or wants to send its IT staff to work in a store for two weeks. But the underlying principle can be applied in a less intensive form. Ethnographic field research, and its more accessible variant, contextual inquiry, does not require you to do the work yourself. Being there is enough. The researchers observe in the real context, ask questions while the work is happening, and see the workarounds that nobody documents and the edge cases that never appear in any interview. Jared M. Spool calls this principle exposure hours: systematic, regular presence with real users in real situations.
The hierarchy is roughly this: doing the work yourself is the strongest form, because you are an actor and not an observer. Being there and observing is the next best option — you are in the context, you see what actually happens. Regular presence without a formal framework is easier to organise and keeps the intuition alive. After that come short interviews with users, as Torres describes. And then, a long way behind: structured interviews and A/B tests. Experimentation culture too often operates exclusively at this last level, structurally maintaining the distance between the people who build software and the people whose working lives depend on it. Users become a source of behavioural signals — not people with a job to do.
Even more speed
There is a cognitive dimension to this debate that rarely gets discussed. ISO 9241–110, the international standard for interaction design principles, lists conformance to user expectations as one of seven fundamental principles. A system that runs continuous experiments on its own interface through agentically accelerated development is structurally incapable of meeting this standard. Users build mental models of software through repeated use. Every experiment that changes something resets part of that model. The cognitive load this creates is real, but it is generally invisible in the metrics used to evaluate whether an experiment succeeded. Added to this: the higher the experimentation frequency, the more discipline is required to clean up failed experiments — dead code paths, orphaned feature flags, UI elements that belonged to a variant that lost.
Agentic development makes this more urgent. If experimentation culture was already a questionable fit for most enterprise contexts, tools that generate features faster and lower the cost of running more experiments do not fix the mismatch — they amplify it. BMAD promises requirements discovery in hours, with an agent interviewing the people who have access to the tool. But the knowledge that matters most in complex domains is implicit, socially distributed, and not accessible through structured interviews. The colleague who stocks the shelves will not be in that interview. Her knowledge will not appear in the specification.
The question is what you need to understand before experimentation becomes meaningful, and whether the current pressure towards speed enables that understanding or structurally prevents it. Users cannot build a stable relationship with software that changes continuously. Sometimes the wisest thing you can do is change nothing.