December 4, 2024

Data Observability 101: An Introduction to the Most Critical Features of Modern Data Observability

When data is your product, your pipelines are supply chains.

Michael Robinson Community Manager Astronomer

An irony of the current demand for data observability solutions is that the data world is finally catching up to the physical world.

Companies selling everything from lunchboxes to water beds have long enjoyed deep visibility into their supply chains. Supply chain management as a discipline goes back to the 1980s. The practice of applying analytics to supply chains is much older.

A company selling a lunchbox can expect to know its status and owner at every stage of the lunchbox’s journey practically from ideation to when a child takes it to school – from manufacturing, to shipping, to warehousing, to merchandising, to point of sale.

Data companies can now expect the same depth of visibility into their data pipelines, including predictive monitoring and analytics.

Whether a company sells geospatial datasets or recommendations for improving returns on investment income or an online gambling platform as a service, data is what’s for sale – or at least part of it.

Even if a company uses data internally to make critical decisions instead of offering it in product form directly to customers, the data is business-critical. For too long, companies have had to rely on outdated monitoring that’s better suited to warehousing than orchestrating.

If you can’t reliably detect job failures or stale data, you can’t diagnose and resolve issues – much less get ahead of them.

This is why data observability is, for many companies, no longer optional or a nice-to-have. This is also why our approach to observability at Astronomer is centered on data products: a way of thinking about data-driven products and the assets in them as tied to business outcomes.

Data observability: the three pillars

Data observability is the ability to understand the health of a data product and its state across a fragmented data ecosystem, and it can be broken down into three basic functions or purposes: analytics, monitoring, and alerting.

Analytics

Analytics concerns the performance, status, and relationships between dependencies in a data product, such as tasks and datasets. Analytics can do more than just “report the news” about a pipeline. An example of pipeline analytics is the tracking of task run performance over time for predictive scoring. For example, the rate at which a task’s run duration has fallen within the standard deviation of that task’s runs over time can enable predictions of how likely it is that the data product will meet expectations for cost and reliability in the future.

Monitoring

Monitoring concerns data quality checks, task and pipeline run duration tracking, run status checking, and service-level agreements (SLAs) for tracking data freshness and timeliness. Generally speaking, SLAs are contracts that specify acceptable delivery of services (platform uptime, for example) and compensation owed in the case of violations (platform usage credits, for example). In the context of data observability, SLAs are expected thresholds for timely delivery or freshness of data. These can be used to measure and track performance as well as trigger alerts.

Alerting

Alerting extends beyond typical, reactive alerting – on task run failure, for example – to encompass proactive alerting. SLAs unlock multiple levels of notifications, so a platform can notify pipeline owners not only when a run duration or a period of time between runs violates an SLA but also when an in-process run duration is within a specified range of missing an SLA. With analytics and monitoring in place, a robust observability solution can use this analytics and monitoring to alert a data product owner in the case of anomalies – for example, a task run duration beyond the standard deviation of that task’s past runs.

Data observability: why it matters

Notifications on task failure alone are sometimes of little use.

If all a data engineer can expect in the way of observability is reactive alerts on task failures – getting an email notification when something has already broken – this does not fully protect an organization’s critical data supply chains.

Teams responsible for designing and maintaining a critical supply chain need to be empowered and supported by data-driven insights into health and performance. Reactive alerts are essential, of course, but, in the case of critical pipelines, they do not offer sufficient protection or enable teams to root-cause analyze (RCA) data products quickly or prevent failures when used alone.

To be fully empowered to protect critical assets, engineers need insight into who owns what up and down their supply chains and across their organizations because critical data products are often fed by assets owned by multiple teams. They also need historical and predictive analytics for insights into the health of those assets: a single pane of glass for everything that feeds a data product.

Teams should be notified if a pipeline is in danger of failing or data for a critical dashboard is in danger of becoming stale, and they should have the ability to identify and address the root cause quickly. This requires, in addition to analytics and monitoring, that an engineer have data lineage tracking in an observability solution – tracking that is ideally enriched with run status and run duration information and other metadata that can help data product owners in the organization debug an issue on their own as much as possible, even if the root cause is in another team’s pipeline.

Why it helps to have data lineage tracking

Data lineage helps you see the big picture of your data, showing where it comes from, how it changes, who owns it at each stage, and where it goes. This clarity makes it easier to trust your data. By tracking data at different levels across various data assets – like warehouse tables, data lake buckets, and Airflow tasks – lineage tracking helps you analyze impacts and fix problems quickly. Knowing how one part of your data affects others lets you make smarter decisions and avoid unexpected issues. This transparency builds trust in the accuracy of data for specific use cases and equips you to make well-informed decisions before committing changes and avoid unwanted changes downstream.

Introducing Astro Observe (public preview)

You can now leverage data products for unified analytics, monitoring, and alerting in a new observability product called Astro Observe (public preview).

Analytics in Astro Observe

The unified data product dashboard provides historical data about the performance of data products against expectations for timeliness and freshness, plus predictive scoring so you have actionable information about the health of assets at a glance. The dashboard enables identifying the tasks that are inefficient or compromising critical business assets before they make something break altogether. Under the hood, the analytics in Astro Observe leverages SLAs (more about which below).

A data product dashboard in Astro Observe. The dashboard displays color-coded predictive scoring and historical analytics based on performance of the data product against expectations for timeliness and freshness.

Astro Observe’s enriched data lineage Graph visualizes upstream and downstream dependencies in all assets in a data product with run status and run duration information on task nodes. There is also a detail drawer providing warnings, recommendations, and a link to Airflow for debugging. You can leverage the Graph to learn about the path that data in your organization takes across various teams, including different deployments, before it gets to its final destination. Without this data lineage tracking, it is harder to know about all the pipelines that touch critical data in an organization, as well as whom to reach out to when upstream tasks fail or delay delivery.

A lineage graph in Astro Observe displaying run duration and run status information on Airflow task nodes.

Monitoring in Astro Observe

You can use SLAs to specify expectations for the timely delivery and freshness of data products. Used in combination with alerting, SLAs powerfully enhance Airflow’s monitoring feature set. In Astro Observe, you can use SLAs to monitor pipelines for late delivery of data feeding a critical report, or for the freshness of data in a critical dashboard – capability that open-source Airflow by itself does not offer. (Well, technically, you can try to use Airflow SLAs, but we don’t recommend it.) SLAs also unlock the analytics insights in the product, so you can see trends in the performance of assets at a glance.

The interface for creating an SLA in Astro Observe.

Astro Observe features automated anomaly detection that warns you about task runs with durations outside the standard deviation of a task’s prior runs, so you can identify and debug flaky tasks before they delay or break your pipelines.

An anomaly detection warning in Astro Observe.

Alerting in Astro Observe

You can set up alerts for notifications when data products miss SLA evaluations. Notifications include links to assets for quick debugging. You can also leverage proactive alerts so you can be notified if an asset is at risk of missing a timeliness or freshness SLA.

The interface for creating a proactive alert in Astro Observe.

Alert notifications from the platform include links to all assets evaluated and their statuses, so you can quickly find and debug the tasks that caused the delay or failure.

Assets evaluated by an SLA, allowing you zero in on the assets that caused the pipeline to be delayed or fail.

Getting started on Astro Observe

Astro Observe empowers teams by giving them deep insight into the supply chains that produce their data products. You can easily create data products with all the disparate assets in your organization that feed a critical data product, making it easy to track performance and fix issues, even across deployments, as they arise.

Astro Observe is now in Public Preview and available to all customers. General availability is planned for early 2025.
If you’re interested in participating in the preview, you can sign up now, and we’ll be in touch to discuss your specific deployments and configuration.