Airflow in Action: DataOps Insights from 2,200 Pipelines at Instacart
Instacart is one of North America’s leading grocery technology companies, working with retailers to transform how people shop. The company partners with more than 1,500 national and local retail companies to facilitate online shopping, delivery, and pickup services from more than 85,000 stores collectively serving millions of customers. The Instacart Platform also provides retailers with advertising services and analytical insights into operations.
At this year's Airflow Summit, Anant Agarwal, software engineer from the company’s Data Infrastructure team, shared how Instacart relies on Apache Airflow® to orchestrate a vast network of large and intricate pipelines securely, compliantly, and at scale.
Airflow’s Past, Present and Future at Instacart
Anant kicked off his talk discussing how Instacart started using Airflow in 2018, initially to orchestrate dbt tasks used in generating financial reports and analytics. Prior to the Data Infrastructure team taking control, orchestration comprised a combination of legacy Airflow clusters used by multiple teams, with no centralized management. This presented a number of issues: teams were running on older Airflow versions, they were regularly encountering scalability issues—and with limited visibility it was impossible to optimize their pipelines.
Fast forward to today and Instacart has one central Airflow cluster managed by the Data Infrastructure team. The cluster runs on Amazon Elastic Container Service (ECS) with a modern version of Airflow (version 2.7.3 at the time of the Summit), which is upgraded every six months to take advantage of the latest features. Airflow supports multiple use cases orchestrating over 2,200 unique pipelines and 16 million tasks per month with 99.5% completion success rate. Stringent auditability and compliance controls are enforced across the Airflow cluster.
Looking towards the future, the remaining legacy Airflow clusters will be removed and all workloads migrated to the central cluster. The team intends to maintain a healthy pace, migrating one set of workloads every quarter off the legacy cluster.
Figure 1: The Airflow deployment at Instacart. Image source.
Instacart’s Custom Airflow Ecosystem: Self-Service, IaC, and Beyond
Key to scaling Apache Airflow across so many use cases at Instacart has been the custom tooling and integrations built by the company’s Data Infrastructure team. Anant outlined these in his talk including:
- Self-service authoring. Teams at Instacart can create their own Airflow pipelines. While self-service helps them get new data workflows up and running faster, Airflow itself is a powerful and versatile tool, making it easy for non-technical users to shoot themselves in the foot. To mitigate this risk, the Data Infrastructure team has created an authoring tool for users to define DAGs using YAML with the UI exposing a limited subset of functionality.
- Infrastructure as Code (IaC). Custom integrations with Terraform makes it fast and easy to spin up and scale resources with consistency across development and production environments, ensuring auditability and compliance controls are properly configured.
- Monitoring and alerting. The company uses Datadog for monitoring of both its Airflow cluster and DAGs.
- Abstractions. The team has built a custom ECS operator and external ECS worker tasks for cluster resource management.
Anant concluded his talk with a set of recommendations based on his experiences with Airflow. To get all of the learnings from Anant and team, watch the Summit session -- Scaling Airflow for Data Productivity at Instacart.
Next steps
While the Instacart team has developed custom operators you don’t have to build your own tooling and integrations to get the best out of Airflow. For example, the DAG Factory is an open source tool managed by Astronomer that allows you to dynamically generate Apache Airflow DAGs from YAML, abstracting away DAG configurations. DAG Factory enables greater cross-team collaboration and self-service on Airflow.
The easiest and fastest way to scale your data pipelines and workflows is to run them on the Astro managed Airflow service:
- With its Organization Dashboards, Astro provides deep monitoring and alerting over your entire Airflow estate.
- Astro Observe goes even further. It tracks data lineage, detects anomalies, monitors data quality, and provides insights into data operations, ensuring that your data remains accurate, timely, and accessible.
- For IaC, you can use the Astro Terraform Provider to programmatically manage your Astro resources.
- For remote workers and efficient cluster resource management, check out the upcoming release of Airflow 3.0. Distributed execution using remote workers that allow you to run Airflow anywhere is one of the key features in the new version.