August 29, 2024

Data Products: It's not what you call them that matters. It’s what you do with them

Jason Ma VP, Product Astronomer

The term “data product” was coined by DJ Patil — former Chief Data Scientist of the United States Office of Science and Technology Policy — in his 2012 book “Data Jujitsu: The Art of Turning Data into Product”. Since that time we’ve seen some evolution in how the term is used, by whom, and in what contexts.

There are some who regard “data product” purely as a marketing term used by “ambitious” tech vendors to elevate the importance of the products they are trying to sell. Some of this cynicism is warranted.

At Astronomer, we see data and platform engineering teams using the term to communicate with the business when describing how data is being processed and applied. The analogy with physical products and supply chains maps naturally to data products and data pipelines, enabling engineers to communicate with more clarity to their stakeholders in non-technical roles.

What’s especially interesting is that members of those same data and platform engineering teams tend to use different terms when communicating with each other. In this context they often prefer talking about things like schema, tables, data models, datasets or data assets. Because the definition of a data product can be ambiguous, using more precise terminology makes a lot of sense.

How we think about data products at Astronomer

We see many data teams define a data product as a reusable data asset that bundles together everything needed to make it independently usable by authorized consumers. To us at Astronomer, data products are pipeline driven assets that capture the entire lifecycle of data within DAGs, tasks, and datasets.

If you want to see how some of the most forward looking data teams are powering data products with Apache Airflow and Astro, take a look at the Foursquare Data Platform: From Fragmentation to Control (Plane) blog post from the company’s CTO.

Clear ownership is a key characteristic of a data product. Ownership ensures accountability and stewardship over the product’s quality, usability, and evolution, aligning it with business goals and user needs while maintaining data integrity, timeliness, and compliance. Clear ownership also facilitates effective decision-making and prioritization, ensuring the product remains valuable and relevant over time.

From buzzword to backbone: Data products are running the show

Whatever your preferred terminology and definition — we’ll stick with data products for the rest of this post — they are becoming more critical to every business. That’s because they are powering everything from analytics and AI (both the analytical and generative varieties) to data-driven software that delivers insights and actions within live applications. Think retail recommendations with dynamic pricing, automated customer support, predictive churn scores, financial trading strategies, regulatory reporting, and more.

There are many reasons why enterprises adopt data products. Key drivers include providing improved reliability and trust in data, composability and reusability, democratized data development and usage, faster innovation with agility and adaptability, closer alignment to the business, heightened security and governance — all underpinned by lower cost and risk.

Sounds great! What could possibly go wrong?

As all Chief Data Officers and Data Engineering leaders know, while the timely and reliable delivery of every product recommendation, dashboard, or fine tuned AI model looks easy, the reality is very different. This is because data products are reliant on a complex web of intricate and often opaque interactions and dependencies between an entire ecosystem of software, systems, tools, and teams from engineering and the business.

Much like the manufacturing supply chains we talked about earlier that take raw materials as an input and deliver finished products to customers as an output, there is huge complexity in reliably delivering data products, with a lot that can go wrong in the production process. Any failure can have a direct impact on revenue and customer satisfaction, it can decrease employee productivity and in extreme cases leave an organization open to regulatory sanction.

Figure 1: A small sample of the dependencies across the data supply chain

Data stacks gone wild: Juggling tools, pipelines, and sanity

A major part of the challenge facing data and platform engineering teams is the complexity of the modern data stack they rely on to build data products. Teams are weighed down by:

The proliferation of specialized tools: Each is designed to handle a specific part of the data supply chain (e.g., ingestion, transformation, storage, BI, MLOps, GenAI RAG, QA, governance, etc). As organizations adopt more tools, managing and integrating each of them quickly becomes overwhelming.
Integration overhead: Ensuring that all these tools work together seamlessly often requires significant custom engineering effort. Each tool has its own configurations, API, and interface, which adds to the complexity of the stack.
Pipeline Fragility: Data pipelines can be fragile, with minor changes in one stage or component potentially leading to failures in others. This fragility can result in high operational overhead as teams spend significant time troubleshooting and maintaining these pipelines.
Cost inefficiencies: Organizations often end up paying for features or services they don’t fully utilize due to the proliferation of tools in the stack. Optimizing cost efficiency while maintaining performance and quality is a significant challenge.

Orchestration is crucial…but it’s only part of the solution

Today, delivering a data product reliably, on time and every time, demands orchestration. This is because orchestration coordinates and manages the interactions and dependencies between the source data, all of the tools responsible for touching it at each stage of the data supply chain, the underlying compute resources, and the different engineering teams that are responsible for building the data product.

However, many orchestration tools offer only limited monitoring and alerting to observe the data product as it traverses the data supply chain. It's not enough to just detect that a database schema has changed, that one task within a pipeline is running at high latency, or that a compute instance has failed. These may initially appear as isolated incidents that can be quickly remediated; however one delay often starts a cascade of errors and failures further into the pipeline which quickly overwhelms the system and the people. These issues can compromise the very reliability and accuracy of the data product itself.

How about using a dedicated observability tool to monitor the pipeline? For one, they add to the complex proliferation of tools data teams are trying to rein in. But more than that current observability tools often fall short when applied to data pipelines because they are primarily designed for data warehousing rather than pipeline monitoring. They struggle to effectively detect job failures or identify when jobs fail to start on time — both of which are crucial for meeting SLAs. These limitations make it difficult to quickly diagnose and resolve issues that are essential for maintaining the integrity and reliability of data pipelines.

Ultimately the limitations in today’s orchestration and observability tools mean platform and data engineers are powerless to prevent data downtime and pipeline errors. They spend their time reacting to failures, rather than proactively managing the data product. Issues are often not detected until the data product is being used (or is missing), by which time it's too late.

This is like only detecting a manufacturing fault when a physical product arrives at a distribution warehouse, or worse, when it’s in the customer’s hands. Returns and scrappage drive incredible waste, lost sales, unhappy customers and possible regulatory and safety consequences.

From monitoring to proactive recommendations: orchestration + observability

The way we develop and build data products needs to change.

What we need to do is unify orchestration with observability in a single platform that provides an actionable view of the data product across every stage of the supply chain, providing proactive recommendations and predictive maintenance. This allows engineers to quickly zero-in on problem areas and, more importantly, get ahead of problems before they result in a failure. These are often buried within complex workflows spread across multiple data pipelines, teams, and deployments.

Owners of data products should have the ability to set thresholds around delivery time and data freshness and forewarning when bottlenecks within workflows risk contravening SLAs or the staleness of the data risks compromising the quality or trustworthiness of the end data product.

By meeting these requirements, our data teams can truly move the needle by:

Tying data pipelines directly to business outcomes. This ensures that every pipeline contributes clearly to strategic goals and delivers quantifiable value to the business.
Improving the reliability and trust of data products.
Lowering costs.
Better securing and governing critical data assets.
Unlocking engineering resources to work on more valuable and productive initiatives for the business.

Through our Astro platform, we are well advanced along the path to unified orchestration and observability. Key capabilities include sophisticated automation, monitoring, error management, alerting, comprehensive UI and API access along with security and governance controls spanning the data, workflow, and infrastructure planes of your data pipelines.

This is a great foundation for you to build on, but a ton more is coming that furthers the mission for data teams. We have some big announcements planned for the Apache Airflow Summit. It's not too late to get your ticket to the event. If you aren’t able to make it live, keep an eye on the Astronomer blog and social handles. We think you’ll be excited!

If you want to explore what’s possible today, you can try Astro for free.