Leverage SLAs for enhanced data quality monitoring
With Service Level Agreements (SLAs), you can use monitoring and proactive alerting to help ensure the timeliness and freshness of the data your Apache Airflow pipelines deliver.
This guide covers:
- Key concepts for understanding SLAs.
- Options for implementing SLAs on OSS Airflow and Astro Observe.
- Use cases for setting up proactive alerting, timeliness time SLAs, and freshness SLAs on Astro Observe.
Assumed knowledge
To get the most out of this guide, you should have an understanding of:
- What Airflow is and when to use it. See: Introduction to Apache Airflow.
- How to start an Astro trial and deploy Airflow dags to the platform. See: Start your Astro trial.
What are SLAs?
Strictly speaking, SLAs are terms of service contracts, often between customers and service providers. They can contain a review period and terms for compensation if a violation occurs. More broadly speaking, SLAs are powerful tools for monitoring the performance of data pipelines. They can apply internally as well as externally, and they help organizations compare data quality indicators to agreed-upon targets.
SLAs depend on SLIs (Service Level Indicators) and SLOs (Service Level Objectives).
SLIs are quantifiable measures of data quality. Common SLIs of importance to both data teams and data consumers are data freshness and timeliness.
SLOs are performance targets for SLIs. For example, an organization could decide that a 0.30% lateness rate or below was acceptable but that a rate above 1% was unacceptable because lateness to this degree could have a material effect on business-critical teams or processes.
Having defined an SLI and SLO, the organization could then establish internal and external SLAs for data teams. Internally, they might commit to an on-time delivery SLO of 99.7%. In SLAs with external stakeholders, they might commit to an SLO of 99.0%.
In the context of data orchestration, an SLA is something different, but related. In this context, an SLA is a feature you can use to measure performance internally against arbitrary SLOs. In theory, you could use an SLA to evaluate the freshness or timeliness of any dataset across all the pipelines in your organization. In practice, teams use this type of SLA to monitor the performance of critical pipelines against internal SLOs.
Examples of critical pipelines for monitoring with SLAs include:
- a pipeline provides data for a report that needs to be delivered by a certain time.
- a pipeline provides data for a dashboard that needs to be consistently fresh.
Using Airflow and Astro, you can set up SLAs with proactive alerting and leverage a recommendation engine to catch problematic tasks in your pipeline before they cause you to breach a contracted SLA with a customer. You can also set up automatic monitoring of the freshness or timeliness of the data that teams delivery to customers as well as internal teams. To take full advantage of the benefits of SLAs, teams implement them in conjunction with proactive alerts.
Astronomer recommends being selective in setting up SLAs. Too many SLAs can be no better than none, as alerts on critical assets can get overlooked. SLAs are best suited for use cases in which critical data must be either always fresh or consistently available on a regular cadence.
This guide offers details about SLAs that you can implement on Airflow and Astro, as well as pointers for identifying use cases for SLAs and choosing the right type of SLA for your use case.
For guidance on setting up SLAs on Astro, see:
Timeliness vs. freshness
Depending on your use case, you might want to use SLAs to monitor timeliness, freshness, or both:
-
Timeliness SLA: if your team owns a pipeline that delivers data on a cadence, for example a dataset serving a critical monthly report, a timeliness time SLA for monitoring on-time delivery would probably be more useful than a freshness SLA, especially if pipelines owned by multiple teams feed the dataset in question. In this case, you would probably expect the data to be "stale" until the pipeline ran each month, making freshness less critical than on-time delivery.
-
Freshness SLA: if your team owns a pipeline that feeds a dataset that must be fresh at all times, for example a dataset for a critical dashboard, a freshness SLA would probably be more useful than an absolute time SLA.
-
Timeliness and freshness SLAs: if your team owns a pipeline that feeds a dataset that must be delivered on a cadence and fresh, for example a dataset that feeds both a critical dashboard and a regular report, both a timeliness and freshness SLA can be useful. In this case, you might want to set up an alert for staleness even if the delivery time met expectations, assuming the threshold for freshness is narrower than that for timeliness. For example, if the delivery cadence is monthly, freshness monitoring on a daily basis would not be overkill if the dashboard is used frequently.
SLAs on OSS Airflow
Airflow's built-in SLAs feature is designed to enable timeliness monitoring. Using an operator parameter, you can set a maximum time duration in which a task should be completed relative to the dag run start time. If a task takes longer than this to run, it should then be visible in the SLA Misses
part of the user interface. You can configure Airflow to send you an email containing all tasks that missed their SLAs.
To set an SLA for a task, you pass a datetime.timedelta
object to an operator's sla
parameter. For more guidance, see: Airflow service-level agreements.
The functionality of Airflow SLAs has known limitations, and changes to the feature are expected. Use with caution.
SLAs on Astro Observe
Using Astro Observe, you can create SLAs to:
- Set up proactive alerting to catch failing tasks before they cause an SLA miss.
- Evaluate the on-time delivery of a Data Product with reactive alerting.
- Evaluate the freshness of a Data Product with reactive alerting.
- Track historical performance metrics including SLA hits and misses at a glance.
For guidance on setting up SLAs on your Data Products on Astro, see: Create and use data products with Astro Observe
Use case: timeliness and freshness SLAs for a dashboard
Let's say your team is responsible for delivering product usage data to internal teams each day at 1 PM ET. The executive team at your company also needs the data to be consistently updated for use in a dashboard. In this scenario, you might want an SLA for evaluating on-time delivery each day and freshness on an hourly basis.
Your team's ETL pipeline looks like this:
To set up the SLAs:
-
On Astro, create a Data Product to track assets in the pipeline. In this case, you can select the pipeline's load task. Doing so will automatically capture upstream tasks in the same pipeline:
For detailed instructions on creating Data Products, see: Create and use data products with Astro Observe.
-
Create new timeliness and freshness SLAs on the SLA Evaluations tab of the Data Product's overview page. For the timeliness SLA, set a verification time of 1:00 PM ET on weekdays and a lookback period of 1 hour. For the freshness SLA, set a policy of 1 day.
For detailed instructions on creating SLAs, see: Create and use data products with Astro Observe
-
Create an alert to get an email notification if either SLA misses.
For detailed instructions on creating alerts on your SLAs, see: Create and use data products with Astro Observe.
-
In the event of an SLA miss, Astro sends email like the following example which contains a link to affected asset(s) for easy debugging:
-
Track your SLA evaluations on the SLA Evaluations tab of the Data Product dashboard. If the first evaluation misses one of your SLAs, the SLA Evaluations tab will log it:
-
Clicking on an SLA evaluation will take you to a tabbed view of details about the asset at the point in time of the miss:
-
The graph tab will show you the task that failed in the pipeline, with the detailed view providing a link to Airflow for easy review of the code:
To learn more about observability on Astro, including how to leverage data products for performance and health monitoring benefits, see Enhance data observability with Astro.