The right size of a Data Product

TL;DR

Boundaries: If a data product is cut too small or too big, people have to stitch data together, ownership gets unclear, logic is duplicated, and incidents become normal.
Consumers: A data product needs real users and use cases now; you should be able to describe its main purpose in one sentence, and data contracts help make expectations explicit.
Ownership: One team must be clearly responsible for meaning, data quality, and operations; without a stable owner, the data product should not exist.
Scope: Aim for the smallest useful standalone unit – easy to use without extra integration work, but not bloated with “maybe useful later” data.
Types: Use different cuts depending on the goal – source-aligned stays within one business domain, aggregates are only worth it when several teams need the same derived view and someone can govern it, and consumer-aligned is built around one decision/process for a specific audience.

Creating data products incorrectly causes predictable problems. Consumers end up having to stitch multiple datasets together just to answer simple questions. Ownership becomes unclear because no one is clearly responsible for semantics, quality, and changes. The same logic is implemented in several places, which is costly and usually inconsistent. Data products can become bloated, which makes it difficult to find the right information and understand the content. As soon as data products become inconsistent and boundaries become blurred, operational disruptions occur, and incidents, refills, and escalations by stakeholders become routine.

The goal is the opposite. A well-cut data product is a stable, consumable package. Its semantics are clear enough that consumers do not need a translation layer. Its profile is coherent, meaning expectations regarding freshness, granularity and reliability are both realistic and achievable. In short, the data product is both self-sufficient and reliable.

Zhamak Dehghani’s wonderful book on Data Mesh is a great source of inspiration for me. Through my work as a consultant in this field and by discussing the matter with my colleagues, I have gained experience in putting this theory into practice. The aim of this text is to help you create successful data product cuts by offering practical heuristics. First, we will discuss the heuristics that apply to all data products, and then we will move on to the specialized heuristics for the three data product archetypes: source-aligned, aggregate, and consumer-aligned.

Fundamentals

Clearly defined consumer and use cases

One of the most important heuristics is consumer fit. A good cut allows most obvious use cases to be put to productive use without consumers having to carry out their own integration work. Its primary purpose should be expressible in a single sentence. A lack of use cases and consumers or target audience means there is no data product.

In order to understand whether the data product is suitable for consumers, requirements engineering is essential, which we can sensibly map with contract-first using data contract workshops. Data contracts are, in a sense, interface descriptions for data product output ports, and they naturally provide an accurate picture of the requirements for the underlying data product.

Stable ownership

The second heuristic is stable ownership. A data product must belong to a single domain or team that is responsible for semantics, quality, and operations. If a clear owner cannot be identified, the data product cannot be defined. This is because data products must be maintained and supported throughout their entire lifecycle to continue meeting consumer requirements.

There are a variety of triggers that result in the genesis of data products. The most important ones are:

A team publishes its data because it conducts its own analyses and assumes that the data could be of interest to others too.
There is a demand for specific data, so the relevant team is approached directly.
A manager or committee decides that a data product must be developed to meet a specific demand for data.

In the latter case, the issue of ownership is particularly important, because it must be clear who will bear the costs of developing and maintaining the data product.

Consistent data quality

The previously described data contracts can provide further insight into the data product cut. They contain information about the characteristics of the data quality. This should be consistent across the output ports because the internal sources of the data product for the individual output ports should be the same. However, this statement must be qualified somewhat, especially with regard to time-dependent characteristics. For instance, a daily CSV export to S3 cannot be as up to date as a real-time data stream on Kafka.

Low integration burden

A data product should minimize integration effort for its consumers. The boundary is more likely to be appropriate when the data product represents the smallest useful unit that does not require consumers to reconstruct the context by combining other datasets. Typically, consumers should be able to start using the data product on its own without building an additional integration layer. If most consumers immediately need to combine it with other products, then the scope is probably too narrow or lacks essential context.

However, we must ensure that certain central data sets do not become single points of failure. Certain master data, for example, are candidates here, as joining them with other data always offers considerable added value and is therefore done frequently. Integrating such a central data product carries risks, including cascading effects in the event of failure, as well as operational complexity, inconsistencies and lag. Taking this into account, the standalone dataset could be more valuable, as it could serve most use cases without enriched data. At the same time, its consistency, reliability and user-friendliness would improve.

Bounded scope

The scope of a data product should be limited and closely match its intended use. It should contain only what is needed for its intended use, not everything that might be useful in the future. Maintaining a narrow scope prevents semantic noise and stops the data product from becoming too big. If the data product begins to include loosely related data or speculative additions, then the scope is likely too broad.

At first glance, this may seem to contradict the principle of low integration barriers. In reality, adhering to both principles results in a stable, sustainable scope for data products, making them as user-friendly as possible while ensuring they are not burdened with unnecessary extras.

Gradient scale from “High Integration Burden” to “Bloated Scope”; arrow marks “smallest useful standalone unit” in the middle.

General heuristics for cutting data products

Clearly defined consumer and use cases

Can you describe the main purpose in one sentence?
Are there any specific teams or roles that want to use this data product right now?

Stable ownership

Is one specific domain or team accountable for semantics, quality, and operations?
Would the owner credibly handle future changes?

Consistent data quality

Are data quality attributes consistent across output ports?

Low integration burden

Is this the smallest useful standalone unit that does not force consumers to stitch data products together?
Can a typical consumer immediately start using this data product meaningfully on their own?

Bounded scope

Does the data product include only what is needed for its purpose?
Is it limited to only including things that are useful in the present?

Source-aligned data products

Semantic coherence

Source-aligned data products form the foundation layer of most data landscapes. They provide consumable, domain-specific data that remains close to the operational truth. In practice, a source-aligned data product typically comprises a consistent set of data, along with all the local dimensions required for interpretation. The key idea is that consumers should receive a reliable, well-defined representation of a single topic without having to reverse engineer the source system.

Finding the right scope starts with the source, but not with the mindset of dumping everything you find in it. The objective is to identify the smallest stable package from the consumer’s perspective. This involves understanding whether the source data can be meaningfully divided, recognizing which elements only make sense together, and identifying a narrow unit that enables consumers to work without having to rebuild integrations. Adhering to these principles will prevent you from fragmenting sources into tiny datasets that only the producing team can comprehend.

Data from a single business domain

Rather than dealing with entire, more technical systems, it is more practical to take an approach that involves aligning with existing domain modules or microservices. Presenting your customers a data product called 'All data from SAP’ usually is too broad, semantically messy, and difficult to manage. A domain module, on the other hand, has clearer responsibility boundaries and a natural owner. In Domain-Driven Design, a microservice is usually associated with a bounded context. Within this context, the domain model usually comprises one or more aggregates (not to be confused with the aggregate data product archetype). These aggregates from Domain-Driven Design are perfect candidates for source-aligned dataproducts.

In reality, it can be pragmatic for a team to create a data product that is connected to a large system consisting of data from several subdomains, such as the aforementioned SAP, in order to save operating costs. However, this would likely violate the principle of domain ownership. We need to find a team that can take ownership of this data product, breaking it down into more specific ones. This data product essentially faces the same challenges as the aggregate data product archetype presented below and should therefore be avoided.

If you want to look at the whole thing from a data architecture perspective, you can use modeling practices such as Kimball’s Dimensional Modeling Techniques. Kimball describes a method for dimensional modeling of data warehouse structures in which business processes are modeled as facts (measurable events) on a clearly defined granularity and contextualized by descriptive dimensions. Advanced modeling techniques such as Conformed Dimensions become interesting at a higher level (aggregate, consumer-aligned) or can even contradict data mesh principles. However, these business processes, facts, and dimensions can form a good basis for cutting data products, provided that data architects are familiar with at least the basics.

The same care is required when deciding which contextual information belongs in a data product. Context that is clearly owned, maintained, and evolved within the same domain should live inside the data product itself. Context that is shared across domains and must remain semantically consistent, such as customers or products, should be exposed as separate data products with their own ownership. Other products should reference these shared domain products rather than redefining them locally. This avoids semantic drift, where core business concepts slowly diverge and end up meaning different things in different parts of the data landscape.

Source data blast radius

From an operational perspective, it should be avoided that routine changes to the source code affect many source-aligned data products. Ideally, you want the impact of changes to the source schema or logic to be as localized as possible. This helps teams minimise the effort required to maintain the data product while offering consumers insight into how the operational world works. However, this is naturally heavily influenced by good upstream design. Therefore, bear in mind any known flaws when designing the output ports of your data products. It is, of course, inevitable that changes to the quality of the source data will affect all data products that depend on the corresponding aligned source data.

Special heuristics for source-aligned data products

Semantic coherence

Does the data product make sense on its own, or does it require the other parts of the source data?
Does it feel like a cohesive, integrated whole rather than a random collection of related items?

Data from a single business domain

Does the cut follow meaningful domain modules rather than whole systems?
Does the data contain only internal or also cross-domain context?

Source data blast radius

Do changes on the data source impact only this data product directly?

Aggregate data products

Value and reuse versus cost and complexity

Aggregate data products are built on top of source-aligned products. They combine or summarize data from multiple source-aligned products to provide semantics that can be used across an entire domain. Think of them as shared building blocks that eliminate the need for repeated integration and calculation work for many consumers. As they encode cross-source meaning, they require greater governance and coordination than source-aligned products.

Aggregates are special because determining ownership and cost allocation is more difficult. A source-aligned data product typically has an obvious owner, which is the domain operating the underlying system. Since the source data of aggregates often span several owners, aggregates should only be created when the shared value is clear, and responsibility can be assigned to a single team capable of maintaining the integrated semantics. The question with aggregates is often not how to cut them, but whether they should exist at all.

It is worth creating an aggregate when:

Several teams require the same derived view with the same meaning. A good rule of thumb is three or more teams. Allowing each team to build its own version would be costly and lead to inconsistent results.
The derivation itself is expensive. Typical examples include feature computation for machine learning, entity matching or deduplication. Reusing one high-quality derivation is cheaper than repeating it in multiple places.
The core value only emerges once sources have been combined. For example, no single source-aligned data product can provide a complete customer view, which only emerges once orders, payments, returns, and CRM signals are brought together.

Scope and governance

They are not useful if the only motivation is a desire for central control or if the shared use cases are actually heterogeneous and would lead to disagreements over definitions. In those cases, it is better to keep the derivations closer to the consumers and provide separate, consumer-aligned products if needed.

Although their size is driven by concrete use cases, they still need a tight scope. The risk is turning an aggregate into a small warehouse that tries to answer every possible question as this would increase their complexity and blast-radius. Aggregates change slowly by nature because any semantic change affects many consumers, so careful governance and clear boundaries are required.

Special heuristics for aggregates

Value and reuse

Are there more than two teams that need the same derived view with identical meaning?
Would teams repeatedly build the same integration or calculation without the aggregate?
Does value emerge only after combining sources?

Cost and complexity

Is the derivation expensive (feature engineering, entity matching, deduplication, cross-source joins)?
Is there someone in the company willing to bear the costs of this data product?

Scope and governance

Is the scope tight enough so the data product is not drifting toward a mini data warehouse?
Is the outcome valuable enough to justify the required strong governance?
Can the owning team maintain the integrated semantics, despite spanning multiple sources?

Consumer-aligned data products

Clear purpose

Consumer-aligned data products are designed for a specific purpose and require minimal effort from consumers. Compared to source-aligned data products, as the name implies, the data is optimized for end users and sophisticated analyses. Rather than presenting domain reality in a general way, they provide precisely the data and semantics required by the consumer to make a based decision, run a process, or power a specific analytical or operational artifact. Success is measured by whether a consumer can use the data product without first building a custom integration layer.

A quick test is to determine if the data product serves one coherent purpose for a defined audience. The result may be a single report or a well-scoped data mart that supports a family of related reports. For instance, a finance data mart that supports «producing consistent margin and revenue reporting for the monthly close» clearly passes this test, even if it powers dozens of reports. The warning sign is not the number of questions answered but whether those questions belong to the same decision context. A data product framed as «supporting finance reporting, marketing analysis, ad hoc exploration, and future machine learning use cases» fails the test because it mixes unrelated audiences and responsibilities.

Natural, focused size

Consumer-aligned products tend to have a natural size, but that size is defined by the job to be done rather than by a single artifact. These products may correspond directly to a specific report or dashboard or support a family of closely related outputs, such as multiple views of the same dashboard. Other examples include reverse ETL flows into an operational tool or a machine learning model with its tailored feature set. This is a heuristic, not a hard rule, but it helps keep scope in check. If a data product feeds many different dashboards or models, it should probably be split into several products, each aligned to a specific consumer.

Meaningful boundaries

Their boundaries usually follow decision or process lines rather than system lines. A natural cut is a business moment at which a decision is made or an action is taken. Examples include a fraud analyst reviewing a case or a marketing lead deciding how to allocate budget. These reductions often contradict the structure of the source systems, but that is acceptable since consumer-oriented products aim to optimize consumption rather than reflect processes.

Business Consumers

Finally, they require clear business objectives and strict scope management. If you cannot identify the people or teams who will use the data product, there is a risk of creating a thinly disguised catch-all dataset. The data product should contain only what is needed to do the job, not everything that might be useful one day.

Special heuristics for consumer-aligned data products

Clear purpose

Can this purpose be expressed as a verb + object sentence (for example, monitor churn by segment, forecast demand by categories, or review fraud cases)?

Natural, focused size

Does the data product support a single decision context that results in one or more related dashboards, reports, or outputs?

Meaningful boundaries

Does the boundary follow a process and not a system boundary?
Does the cut reflect how a consumer acts or decides, not how data happens to be stored?

Business Consumers

Will your data product be used by business users, data analysts, data scientists, or applications?

Artikel