Leveraging data products for health and performance benefits
A data product is a pipeline-driven asset that captures the data lifecycle of a pipeline. Tasks, datasets, warehouse tables, and local files can all be assets of data products. Data products are abstractions that serve the purpose of gaining observability into the health and performance of data pipelines.
See also:
When to define a data product
Not all tasks and datasets are critical to business needs or internal teams, so it pays to be deliberate when creating data products.
Consider creating a data product when the end result of one or more pipelines:
- Is crucial for business operations, decision-making, or compliance.
- Is depended on by multiple teams or external partners.
- Involves complex processes or multiple sources.
- Is subject to regulatory requirements (GDPR, HIPAA).
- Contains or touches sensitive data such as customer data or PII.
- Is required to be available at a particular time.
Examples
Defining a data product makes particular sense when business-critical data is involved, multiple teams touch the pipeline, and on-time delivery and reliability of the data are of primary importance.
For example, you might want to use a data product if your organization manages regulatory data coming from different departments, produced by multiple teams, and the data in feeds a crucial compliance report.
Creating a data product for these critical assets enables visualization and monitoring of the flow of data through the distributed tasks that generate and modify the tables.
In Observe, you see a lineage graph that visualizes the path of the report's data from all sources through the DAGs that extracted, transformed, and loaded the data into the data product.
Each node represents an asset with a unique identifier, the emitting system (Apache Airflow, Snowflake), and the length of time since the asset was last observed.
Expanding a nested DAG node reveals connected task nodes, each with a unique identifier, the containing DAG, the task status, the run duration, and the time since the task was last observed.
A key benefit of a lineage graph is how easy it makes identifying upstream assets, offering visibility into the tasks that could delay delivery or compromise the quality of business-critical data if they failed, along with the owners of those tasks.
Metadata collection enabled by data products also unlocks analytics, monitoring, and alerting, including proactive alerting using SLAs. For example, on Astro Observe, you could use the SLA success rate metric to identify performance issues before they broke a report like this.
Other assets for which you might want to define data products include:
- A table used by multiple teams to generate executive dashboards.
- A table containing customer data accessed by multiple teams in your organization.
Examples of when it probably does not make sense to define data products:
- For objects created by a single task with no upstream or downstream dependencies.
- For a table with non-critical or non-sensitive data in a sandbox that is used only occasionally.
Leveraging data products on Astro
To learn more about how to create and manage data products, manage alerts, and evaluate pipeline health on Astro, see the Astro Observe documentation.