Summary of the "Data Mesh in Practice" book
This article provides a summary of the book "Data Mesh in Practice - How to Set Up a Data-Driven Organization". The book can be freely downloaded here. There is also a podcast on Data Mesh Radio with one of the authors. The book is an approx. 60-page report on how to operate the data mesh architecture within an enterprise. The book is divided into two main parts. Firstly, a description of the data mesh architecture and the reason why it exists. Secondly, the book shows how the data mesh journey can be executed within an enterprise.
Part 1: What Is Data Mesh and Why Do We Need It?
Pain Points of Centralized Data Responsibility
The data warehouse approach is to create a centralized, reliable and single source of truth repository. This approach does not scale and it is impossible to provide high quality source of truth from a dynamic system. The data lake approach has similar issues. All data is loaded into the data lake, the acctual quality control happens at read time (schema-on-read). Both approaches rely on a central data infrastructure team responsibility. Such teams are usually detached from use cases and the domain knowledge of the source systems.
The Pillars of Data Mesh
Data mesh is not another iteration of a centralized data analytics architecuture (such as data warehouse, data lake, data lakehouse, ...). The data mesh can be considered a paradigm shift and is built on four pillars:
Decentralized domain ownership
Business (customer, sales, orders, ...) or technical (click streams) topics, which are complex and important enough, form a domain. Such a domain should define the boundries of ownership. The decentralized ownership of a domain, by domain experts, allows for scaling (new sources, changes, ...).
Data as a product
Analytical data is treated with a product thinking mindset. The data product must be continously aligned by the feedback from the consumers. A data product can be started as a minimal viable product. The internal working of a data product is up to the team. A data product provides data in the form of an SQL API, storage access or similar. A data product is not intended for end users. A data product is an architectual quantum or basic building block of the data mesh. The success of a data product should be measured (e.g., usage metrics) and by that the product lifecycle should be guided.
Self-service data infrastructure platform
A data mesh is built on top of an infrastructure-as-a-platform self-service platform. Such a platform should provide the tools and services to allow domain teams to easily build data products. The platform should be centrally managed, but the platform should be domain agnostic and provide no domain-specific tool support.
Federated computational data governance
A centralized data governance would be a bottleneck, however a global data governance is required, e.g., for legal compliance. The data platform should take care, that such data governance rules are taken care of computationally.
Part 2: The Data Mesh Journey
A truly data-driven organization is based on decentralized data ownership. Adapting an organization to the data mesh paradigm takes time and is a journey in three steps:
- Mindset shift towards data mesh culture
- Platform support to scale
- Federated governance
Getting Started: A Data Product–Centered Mindset Shift
It is a misconception that change can be forced by technology. Organizations sit on top of large amounts of data, which are poorly documented and lack data ownership.
The chapter presents a case study, which explains by an example, what the lack of data ownership means. The case study makes the problems unclear data ownerhip clear. The section should be read in the book for more details.
The perspective of the data producers is more towards operational services. With awareness on how the data is used for analytical purposes and by early communication of changes in the operational domain, incidents in the analytical domain can be prevented. By creating incentives for the upstream dependency in the data flow and long term collaboration and success can be established. This can happen through social incentivization, e.g., by mentioning the upstream work as important when announcing new features. It can also happen through material incentivization, e.g., by producing value of the downstream data and by that increase the team-level budget of the upstream dependency to be able to invest more resources to fulling the downstream data requirements.
Getting started with data mesh is primarily a mindset change and not a technology or tool change. Such an organizational change should be started small. Within a "fail fast" setting an MVP of a first data product should be built. Also consumers of this data product should be involved. It is important to define the minimum requirements very clearly.
The infrastructure platform should not be built before the first data product, but with the first data product. To reduce the risk of platform overengineering only things, which are really required should be built. The infrastucture might be built completely on a greenfield or by adapting existing infrastructure.
Scaling the Mesh: Self-Serve Data Infrastructure
Data infrastructure teams are overloaded with central responsibility. The central team needs domain knowledge for decision making and manual centralized processes don't scale. With a case study on centralized compute capabilities the authors examplify the problems of teams with central responsibility. This case study should be read directly in the book to understand it better. Central infrastructure with decentralized responsibility would prevent the overloading of the central team.
The data infrastructure for the data mesh has many capabilities, .e.g.:
- Storage
- Compute
- CI/CD
- Data catalog (discovery)
- Access control management
- Data transformation blueprints
- Data product blueprints
- Computational governance
The data platform should be built on open standards to allow for interoperability.
Sustaining the Mesh: Federated Computational Data Governance
In order to prevent that two data products are not interoperable, as e.g. they use different identifiers for the same domain object, the federated data governance pillar of data mesh is important. Data mesh has no central source of truth, but multiple contextualized versions of truth. Things with the same name (e.g., customer) must not have the same meaning in each context. Cross-domain mappings and the identification of polysems must be handeld by a federated governance group. This should not end in a centralized enterprise model, but more on a protype-based approach and contextual level (maybe with short-lived cross-domain working groups).
There is a necessity for global data governance rules, e.g. in the context of GDPR, which need to be followed. The self-service platform should provide a service to enforce data governance rules (e.g., the deletion of a user account in all data products or the encryption of PII data).
Industry Practices
Common Pitfalls
- Overloading your people: Capacity has to be available to pursue the data mesh journey.
- Creating a Platform with Central Data Responsibility: The platform should be domain-agnostic and no central responsibility for domain data. As an example, the GDPR right to be forgotten should be implemented in a data-agnostic way.
- Building the Perfect Platform Up Front: The platform development should be iterative and driven by the needs of stakeholders and no waterfall design.
- Misunderstanding the Data Mesh Concept: Data mesh is a journey looks different in each company.
Best Practices
- Start Small, but with Commitment: Select a meaningful use case with impact. Make the first data product a success that can demonstrate the concept and allows for learning.
- Define Your Domains Following Your Business Capabilities: Domains are important in the data mesh setup.
- Evangelize Data Mesh: Foster continous exchange between the data mesh practitioners.
- Apply Product Thinking to Platform Development: Start with MVP and follow with a prioritized product backlog.