October 29, 2024

Airflow in Action: Data Engineering Insights from Uber and Its 200,000 Data Pipelines

M Matthew Keep Astronomer

Uber’s data platform is crucial to the company’s business model. It serves 156 million active customers and 7.4 million active drivers scheduling 30 million trips per day in 10,000 cities across 70 countries. The data platform powers a host of critical workloads spanning business and customer processes (i.e., tracking orders and trips, regulatory reporting, recommendations and promotions) BI and predictive analytics, risk, fraud prevention, and many more.

Back in 2016 Uber faced a significant challenge as their data platform matured: managing multiple, disparate data workflow tools used by various teams. These systems included a mix of off-the-shelf tools like Apache Oozie®, Apache Airflow®, Jenkins, along with custom-built solutions written in Python and Clojure.

Each team member, whether a data scientist or developer, had to navigate these different systems to move data. This lack of standardization meant massive duplication of efforts between teams, inefficient resource utilization and operational challenges with painful upgrades, difficult migrations and security risks. The maintenance burden was immense. Each system required dedicated resources to operationalize and for troubleshooting, bug fixing, and user education, making it harder to scale and adapt to Uber’s insatiable demands for data.

Converging on a single workflow system

At this year’s Airflow Summit, Uber’s Data Workflow team shared how they addressed these challenges by standardizing on a single workflow system. That system needed to handle the company’s scale, while being flexible enough to accommodate a wide range of users and use cases.

After evaluating multiple potential technologies, Uber selected Apache Airflow. The company’s decision was based on:

Unified Workflow Platform: By standardizing on Airflow, Uber was able to replace multiple fragmented systems with one flexible solution, reducing complexity and operational overhead.
Scalability: Airflow was selected for its ability to scale and manage workflows across Uber’s vast infrastructure, orchestrating any size of data movement and processing tasks.
Ease of Use Across Teams: The Airflow-based DSL (domain-specific language) provided an intuitive interface accessible to a wide array of users, including data scientists, developers, ML engineers, and operations staff.

Step-by-step: Uber’s Airflow journey

The Workflow team’s session at the Summit covers Uber's journey with Airflow including creating Piper, their internal Airflow fork. Piper provided the customizations necessary for meeting Uber’s unique scalability demands along with integration into the company’s proprietary internal tooling.

The session steps through each stage of Uber’s journey with projects like scheduler isolation, a UI-based authoring experience, cross-region disaster recovery — all leading to today’s hybrid cloud architecture.

Figure 1: Hybrid cloud architecture supporting workflows running on-premise and in the cloud. Image source

Today the Airflow-based Piper platform is used by 1,000 teams at Uber with 200,000 different pipelines, orchestrating 450,000 average pipeline runs and 750,000 average task runs every day.

The Uber team concluded by discussing how they plan on converging Piper more closely into the mainline Apache project with the upcoming release of Airflow 3.0, motivated by features such as event-driven scheduling. In fact the engineers see so much value in Airflow 3.0 that they intend to contribute code back to Airflow.

Learn more

You can get all of the details by watching the Airflow Summit replay session Evolution of Airflow at Uber.

To get the best Airflow experience, build and run your workflows on the Astro managed service, which you can try for free here.

Airflow in Action: Data Engineering Insights from Uber and Its 200,000 Data Pipelines

Converging on a single workflow system

Step-by-step: Uber’s Airflow journey

Learn more

Build, run, & observe your data workflows. All in one place.

Build, run, & observe
your data workflows.
All in one place.