Data Products: It's not what you call them that matters. It’s what you do with them
8 min read |
The term “data product” was coined by DJ Patil — former Chief Data
Scientist of the United States Office of Science and Technology Policy —
in his 2012 book “Data Jujitsu: The Art of Turning Data into Product”.
Since that time we’ve seen some evolution in how the term is used, by
whom, and in what contexts.
There are some who regard “data product” purely as a marketing term used
by “ambitious” tech vendors to elevate the importance of the products they
are trying to sell. Some of this cynicism is warranted.
At Astronomer, we see data and platform engineering teams using the term
to communicate with the business when describing how data is being
processed and applied. The analogy with physical products and supply
chains maps naturally to data products and data pipelines, enabling
engineers to communicate with more clarity to their stakeholders in
non-technical roles.
What’s especially interesting is that members of those same data and
platform engineering teams tend to use different terms when communicating
with each other. In this context they often prefer talking about things
like schema, tables, data models, datasets or data assets. Because the
definition of a data product can be ambiguous, using more precise
terminology makes a lot of sense.
How we think about data products at Astronomer
We see many data teams define a data product as a reusable data asset that
bundles together everything needed to make it independently usable by
authorized consumers. To us at Astronomer, data products are pipeline
driven assets that capture the entire lifecycle of data within DAGs,
tasks, and datasets.
If you want to see how some of the most forward looking data teams are
powering data products with Apache Airflow and Astro, take a look at the
Foursquare Data Platform: From Fragmentation to Control (Plane) blog
post
from the company’s CTO.
Clear ownership is a key characteristic of a data product. Ownership
ensures accountability and stewardship over the product’s quality,
usability, and evolution, aligning it with business goals and user needs
while maintaining data integrity, timeliness, and compliance. Clear
ownership also facilitates effective decision-making and prioritization,
ensuring the product remains valuable and relevant over time.
From buzzword to backbone: Data products are running the show
Whatever your preferred terminology and definition — we’ll stick with data
products for the rest of this post — they are becoming more critical to
every business. That’s because they are powering everything from analytics
and AI (both the analytical and generative varieties) to data-driven
software that delivers insights and actions within live applications.
Think retail recommendations with dynamic pricing, automated customer
support, predictive churn scores, financial trading strategies, regulatory
reporting, and more.
There are many reasons why enterprises adopt data products. Key drivers
include providing improved reliability and trust in data, composability
and reusability, democratized data development and usage, faster
innovation with agility and adaptability, closer alignment to the
business, heightened security and governance — all underpinned by lower
cost and risk.
Sounds great! What could possibly go wrong?
As all Chief Data Officers and Data Engineering leaders know, while the
timely and reliable delivery of every product recommendation, dashboard,
or fine tuned AI model looks easy, the reality is very different. This is
because data products are reliant on a complex web of intricate and often
opaque interactions and dependencies between an entire ecosystem of
software, systems, tools, and teams from engineering and the business.
Much like the manufacturing supply chains we talked about earlier that
take raw materials as an input and deliver finished products to customers
as an output, there is huge complexity in reliably delivering data
products, with a lot that can go wrong in the production process. Any
failure can have a direct impact on revenue and customer satisfaction, it
can decrease employee productivity and in extreme cases leave an
organization open to regulatory sanction.
Figure 1: A small sample of the dependencies across the data supply
chain
Data stacks gone wild: Juggling tools, pipelines, and sanity
A major part of the challenge facing data and platform engineering teams
is the complexity of the modern data stack they rely on to build data
products. Teams are weighed down by:
-
The proliferation of specialized tools: Each is designed to handle a
specific part of the data supply chain (e.g., ingestion, transformation,
storage, BI, MLOps, GenAI RAG, QA, governance, etc). As organizations
adopt more tools, managing and integrating each of them quickly becomes
overwhelming. -
Integration overhead: Ensuring that all these tools work together
seamlessly often requires significant custom engineering effort. Each tool
has its own configurations, API, and interface, which adds to the
complexity of the stack. -
Pipeline Fragility: Data pipelines can be fragile, with minor
changes in one stage or component potentially leading to failures in
others. This fragility can result in high operational overhead as teams
spend significant time troubleshooting and maintaining these pipelines. -
Cost inefficiencies: Organizations often end up paying for features
or services they don’t fully utilize due to the proliferation of tools in
the stack. Optimizing cost efficiency while maintaining performance and
quality is a significant challenge.
Orchestration is crucial…but it’s only part of the solution
Today, delivering a data product reliably, on time and every time, demands
orchestration. This is because orchestration coordinates and manages the
interactions and dependencies between the source data, all of the tools
responsible for touching it at each stage of the data supply chain, the
underlying compute resources, and the different engineering teams that are
responsible for building the data product.
However, many orchestration tools offer only limited monitoring and
alerting to observe the data product as it traverses the data supply
chain. It’s not enough to just detect that a database schema has changed,
that one task within a pipeline is running at high latency, or that a
compute instance has failed. These may initially appear as isolated
incidents that can be quickly remediated; however one delay often starts a
cascade of errors and failures further into the pipeline which quickly
overwhelms the system and the people. These issues can compromise the very
reliability and accuracy of the data product itself.
How about using a dedicated observability tool to monitor the pipeline?
For one, they add to the complex proliferation of tools data teams are
trying to rein in. But more than that current observability tools often
fall short when applied to data pipelines because they are primarily
designed for data warehousing rather than pipeline monitoring. They
struggle to effectively detect job failures or identify when jobs fail to
start on time — both of which are crucial for meeting SLAs. These
limitations make it difficult to quickly diagnose and resolve issues that
are essential for maintaining the integrity and reliability of data
pipelines.
Ultimately the limitations in today’s orchestration and observability
tools mean platform and data engineers are powerless to prevent data
downtime and pipeline errors. They spend their time reacting to failures,
rather than proactively managing the data product. Issues are often not
detected until the data product is being used (or is missing), by which
time it’s too late.
This is like only detecting a manufacturing fault when a physical product
arrives at a distribution warehouse, or worse, when it’s in the customer’s
hands. Returns and scrappage drive incredible waste, lost sales, unhappy
customers and possible regulatory and safety consequences.
From monitoring to proactive recommendations: orchestration +
observability
The way we develop and build data products needs to change.
What we need to do is unify orchestration with observability in a single
platform that provides an actionable view of the data product across every
stage of the supply chain, providing proactive recommendations and
predictive maintenance. This allows engineers to quickly zero-in on
problem areas and, more importantly, get ahead of problems before they
result in a failure. These are often buried within complex workflows
spread across multiple data pipelines, teams, and deployments.
Owners of data products should have the ability to set thresholds around
delivery time and data freshness and forewarning when bottlenecks within
workflows risk contravening SLAs or the staleness of the data risks
compromising the quality or trustworthiness of the end data product.
By meeting these requirements, our data teams can truly move the needle
by:
-
Tying data pipelines directly to business outcomes. This ensures that
every pipeline contributes clearly to strategic goals and delivers
quantifiable value to the business. -
Improving the reliability and trust of data products.
-
Lowering costs.
-
Better securing and governing critical data assets.
-
Unlocking engineering resources to work on more valuable and productive
initiatives for the business.
Through our Astro
platform,
we are well advanced along the path to unified orchestration and
observability. Key capabilities include sophisticated automation,
monitoring, error management, alerting, comprehensive UI and API access
along with security and governance controls spanning the data, workflow,
and infrastructure planes of your data pipelines.
This is a great foundation for you to build on, but a ton more is
coming that furthers the mission for data teams. We have some big
announcements planned for the Apache Airflow
Summit.
It’s not too late to get your ticket to the event. If you aren’t able to
make it live, keep an eye on the Astronomer
blog
and social handles. We think you’ll be excited!
If you want to explore what’s possible today, you can try Astro for
free.