Data Mesh: Decentralized Data Analytics for Software Engineers

Dieser Artikel ist auch auf Deutsch verfügbar

Valuable information is to be found in the data of operative IT systems. This information allows conclusions to be drawn on the behavior of users and can help us to better understand the software system. However, software developers often neglect data analysis in their projects. For many project managers and product owners data play a subordinate role in the evaluation of features and user stories, primarily because they are rarely available in analytically prepared form.

Traditionally, data are collected and analyzed by data teams in data warehouse systems and data lakes. However, when evaluating these data, the focus is usually on corporate management and marketing, but rarely on the teams that develop new features. Moreover, the results of these evaluations are often disappointing. This is often due to inadequate data quality in the connected source systems and a lack of expertise in interpreting these data by centrally organized data teams. In addition, such centralized systems and teams do not scale sufficiently quickly with the constantly growing data analysis requirements.

Data Mesh is a relatively new, decentralized data architecture approach designed to enable development teams to independently perform cross-domain data analysis to better understand the behavior of their users and systems. Making curated data available to other teams creates a valuable decentralized data network.

Modularization also in the Data Architecture

Significant advances have been made in software development in recent years. Strategic domain-driven design (DDD) helps to organize and depict the technical functionality in a system. Clear responsibilities for individual domains can be determined and the sociotechnical relationships and interfaces can be identified and described by means of context mapping, for example.

Autonomous product teams in the organizational structure assume responsibility for precisely one delimited domain and in return are given a high degree of freedom, from the choice of programming languages to team staffing. As a result, these teams build software systems that depict the technical scope of their bounded context. For this purpose, they use modular software architecture approaches such as microservices (in the sense formulated by Sam Newman in 2015 already) or self-contained systems in order to avoid monolithic systems and to best implement quality objectives such as reliability and maintainability. Interaction with other domains takes place via clearly defined interfaces, where possible asynchronously as domain events, for example via Apache Kafka or HTTP feeds.

The splitting of monolithic structures into separate technical domains can now also be applied to data architectures according to the Data Mesh principle. The prerequisite for this is a division of the teams based on the domain boundaries. Instead of a central data warehouse that brings together all data and is managed by a data team, domain teams perform analytical evaluations for their specific domain themselves and access data sets from other domains via clearly defined interfaces.

The Principles of Data Mesh

The term “Data Mesh” was coined in 2019 by Zhamak Dehghani in her article “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh”. The definition usually encompasses the following four principles:

Domain ownership
Data as a product
Self-service platform
Federated governance

Domain ownership means that the teams responsible for the operational systems also provide their analytical data. The core idea behind this is that data only have a clearly defined meaning within its bounded context, and that the respective teams know their own data best. Each team therefore decides individually which data are of importance and how they are prepared for analytical purposes. Each team takes full responsibility for its data.

According to the principle of data as a product, analytically relevant data should be managed and stored in such a way that other teams can easily access them. The data are subject to the usual development process within the team, from the description of a user story by the product owner and comprehensible documentation on data access and the significance of the individual fields through to operational responsibility, which includes monitoring. With its data, the team offers a valuable contribution to other teams.

The self-service data platform describes a data platform that is used by teams to provide analytical data, data evaluations, and visualizations, as well as for easy access to data from other domains. The focus is on the self-service character: All activities should be possible without the need for another team to intervene.

Federated governance refers to cross-domain agreements that regulate how interaction can be designed efficiently and how the quality of the data landscape can be guaranteed in the long term. To this end, representatives of the domains jointly define the global policies (compare the macro-architecture guidelines) that are necessary for interoperability and security. At a higher level of expansion, compliance with these rules can also be ensured automatically by the platform.

These principles are appropriate for describing responsibilities in decentralized data architectures. But what do they mean for software developers in the domain teams?

Domain-Specific Data Analysis with a View to the Overall Result

From the perspective of the domain teams, in particular, the self-service data platform is exciting at first. To date, developers of operational systems have usually not undertaken data analysis because it generally cannot be performed efficiently on operational databases and has an impact on the performance of the productive application. However, the provision of an easy-to-use data platform now enables teams to prepare data analytically and perform data analysis independently.

A data platform is typically managed by a dedicated data platform team and encompasses storage, data ingestion, processing, analysis, visualization, and the managing of metadata and permissions. A data catalog is used to locate and document data products – usually a combination of data, an AI model, and the presentation of the results – and to comply with agreed guidelines.

With the described data platform, the teams are now able to provide data for analytical purposes. Once the data have been transferred to the platform, they still need to be prepared and cleaned. For this purpose, it is advisable to first import data in the source format (e.g., JSON) into a CLOB (character large object) field so that the database schema does not have to be considered yet. In the next step, developers should convert the data into a structured SQL table format, deduplicate them, and, if necessary, mitigate structural changes and null values. In view of data protection requirements, they should also remove or at least anonymize sensitive information (personal data, credit card information) as early as possible, preferably before storing it in the platform. This cleaning can be done much more effectively by domain teams that know their operational data well than by central data teams that lack the technical context. Such focused data preparation contributes to a higher level of data quality.

With the cleaned data, the teams can perform analyses on their own. Initially, these are usually simple SQL queries. Using JOIN operations, data records from different source systems (for example, different microservices) can be merged and consolidated using aggregation functions. Window functions are particularly helpful for performing partitioned multiple-row evaluations. Visualizations help to identify trends and anomalies. Domain teams can use their own data to explain past system performance and track user behavior. This is especially helpful when it comes to deriving new features from the findings and evaluating the benefits of user stories early on, rather than just prioritizing on the basis of gut feeling. It is also easier to identify system errors, such as those that always occur shortly after a particular event. With Data Mesh, a domain team can perform evaluations independently, and more quickly incorporate insights into operational improvements.

What is still missing is an overview of the overall result, for example the effects of a UI change on the home page of a web store on a subsequent purchase. Another example from e-commerce could concern the conversion rate. This measures the number of purchases actually made. Interesting, but initially ignored, is the question of whether customers returned an ordered item. This is where the networking of data from different domains comes into play: If the purchase-completion domain makes the result of a session and the product-returns domain makes the returned items available as a data record on the platform, correlations – and even meaningful causalities using A/B testing, for example – can be identified immediately. This approach corresponds to the Data Mesh principle of data as a product. A domain provides certain relevant data sets in an accessible and documented form. The use and linking of cross-domain data thus leads to increasingly meaningful analyses.

However, when linking data across domains, the problem arises that the data and terms of a domain are only valid within its bounded context. This is where the context mapping method from domain-driven design helps to describe the relationships between the domain models. A data set that follows the open host service pattern, for example, must be described more precisely than a data set that was created as a result of a bilateral customer/supplier agreement.

Joint agreements help to ensure technical and semantic interoperability, for example by all parties involved agreeing on a format (SQL, JSON, etc.), a language (German, English, etc.), and the designation and form of domain keys. Representatives of the domain teams and the platform team reach the necessary agreements together and document them as global policies.

A data product includes metadata in addition to the actual data set.
Like an API, data products are also continuously monitored (Fig. 3). — A data product includes metadata in addition to the actual data set. Like an API, data products are also continuously monitored (Fig. 3).

Data as a product goes even further: If a data set is available for other teams, it can be compared with a productive API. Accordingly, the team must also continuously ensure the availability and quality of the data. All data sets must be clearly documented and findable. However, changes to the data model are now much more difficult. Monitoring is necessary to oversee the availability, quality, and completeness of the data. Analytical data thus joins the ranks of a domain's interfaces – similar to UI and APIs.

The product owner is responsible for ensuring that the provision of data remains economically viable from a domain and business perspective. Since it is usually not efficient to make all domain data available as a data product for other teams (even if the teams using the data may wish this), a clear decision must be made: Which internal data are only required for internal analysis purposes, and which should be available as a data product to other teams? For example, domain events and master data that are frequently referenced should as a rule also be available to other domain teams. If in doubt, the teams must consult each other and agree on the required data products.

Data Mesh in Practice

Data Mesh is most suited as a bottom-up approach with a jointly defined goal, but how do you actually put it into practice? Ideally, it is started by one or (preferably) more teams that express an interest in data analysis. They agree on a data platform – cloud services such as Google BigQuery, AWS S3, Azure Synapse Analytics, or Snowflake are nowadays common. For getting started, however, a PostgreSQL database combined with a visualization tool like Metabase or Redash can also suffice. Each team is given its own area (project/account/workspace, etc.) in which it independently creates resources such as databases and tables. A logical structure defines areas in which the different types of analytical data should be located (see Fig. 4).

Example of a logical information architecture in Data Mesh (Fig. 4).

In the next step, the team then enters data from the operational system into the data platform. These usually end up unstructured in a raw area. Integrations such as Kafka Connect can be used for this purpose, or the team implements its own consumer that calls the streaming ingestion API of the respective platform. As a rule, domain events, if available, are well suited as the primary analytical data basis, as they represent facts of the business processes. Database exports via ETL batches, on the other hand, should be avoided if possible in order to enable real-time analysis without exposing the operational database schema.

The cleaning of the data is then done, for example, as an SQL query on the raw data. These data are created as a view, as the following listing shows. Duplicates are removed and JSON is mapped to individual fields. Common table expressions are well suited for structuring.

-- Step 1: Deduplicate

WITH inventory_deduplicated AS (
   SELECT *
   EXCEPT (row_number)
   FROM (
            SELECT *,
                   ROW_NUMBER() OVER (PARTITION BY id ORDER BY time DESC) row_number
            FROM `datameshexample-fulfillment.raw.inventory`)
   WHERE row_number = 1
),
-- Step 2: Parse JSON to columns

inventory_parsed AS (
   SELECT
       json_value(data, "$.sku")                            AS sku,
       json_value(data, "$.location")                       AS location,
       CAST(json_value(data, "$.available") AS int64)      AS available,
       CAST(json_value(data, "$.updated_at") AS timestamp) AS updated_at,
   FROM inventory_deduplicated
)
-- Step 3: Actual Query

SELECT sku, location, available, updated_at
FROM inventory_parsed
ORDER BY sku, location, updated_at

The cleaned data sets form the basis for the team's internal data analysis. It is usually necessary to distinguish between unchanging events and master data entities that change over time. Using SQL queries, the teams can now analyze these. Jupyter Notebook is often used to build up and document exploratory findings.

Jupyter Notebook is recommended for explorative data analysis (Fig. 5).

Visualizations help to make data easier for people to understand. The data platform should provide a suitable tool for this purpose – Figure 6 shows Google Data Studio as an example. A prerequisite for visualization is access to the tables or the corresponding queries. The aggregation of the data, on the other hand, can also be controlled directly in the tool.

Example of a Google Data Studio bar chart (Fig. 6).

Finally, data should also be made available to other teams as data products. It is advisable to use a view here to ensure a consistent view of the data externally, even if the underlying data sets change. Such migrations can then be tracked in the query of the view. Other teams are now granted permission for this view and can access it via SQL query. In addition, the data set must be documented according to the agreed procedures. In the simplest case, this can be done in a wiki or Git repository. In a more advanced stage of development, however, a data catalog should be used that provides meta information about the data set as well as the quality, and in which individual fields are precisely documented (Fig. 7).

Data products can be documented in a data catalog, for example (Fig. 7).

Over time, this creates a mesh of data products from different domains that can be used across the board. Ideally, the Data Mesh encourages other teams to also use the platform with the data provided and ultimately to provide analytical data themselves.

The approach described here is pragmatic for software developers in that it is possible to work with SQL – a tool with which most IT specialists should be familiar. On the other hand, the handling of data for visualization, as well as statistical methods, may have to be learned anew. More complex tools for data preparation such as Apache Spark, Apache Beam, or NumPy are not absolutely necessary, at least at the outset.

Data Mesh as an Innovation Driver for New Domains

Data Mesh is primarily about domain teams being able to perform data analysis on their own. But what happens to the central data teams that already exist in many organizations?

The data team is predestined to operate and manage the data platform described above due to its knowledge, as permissions still need to be set up and costs controlled when using self-service cloud services. The team can also act as enablers and ambassadors to encourage and help other domain teams to use the data platform. By providing templates and best practices for common use cases, they make the platform attractive. Experts can also move into the domain teams in an advisory capacity for limited periods of time in order to build up necessary skills.

Existing data products such as reports, for example for corporate management, must continue to be provided. This task initially remains with the existing data engineers. However, their work becomes much easier: If domain teams provide cleaned data sets in a clearly agreed format and in consistently high quality, tedious data extraction and preparation steps are eliminated. Consideration should be given to establishing separate domain teams for data-intensive domains as well, such as for corporate management or marketing. Specialist domains such as customer profiling or sales predictions are also conceivable for data science teams, which provide both operational and analytical services based on data evaluations and machine learning models.