Airflow in Action: ETL Insights from Bloomberg — Slashing Runtimes by 50%
In the financial data marketplace, Bloomberg stands as a leading provider of comprehensive, high-quality data to global financial institutions. Within this ecosystem, Bloomberg’s mortgage team supports investor and analyst clients by delivering critical datasets on approximately 50 million loans and 1.4 million securities, comprising nearly 5 billion data points.
The unique complexity of this data, provided twice monthly by government-sponsored mortgage entities like Fannie Mae, Freddie Mac, and Ginnie Mae, posed significant operational challenges that Bloomberg aimed to resolve by transitioning to an automated data orchestration framework.
In their session Streamlining a Mortgage ETL Pipelineat this year’s Airflow Summit, Bloomberg software engineers Zhang Zhang and Jenny Gao discuss the challenges they faced with the existing workflow, how they selected an orchestration solution, and the results they’ve achieved with Apache Airflow®.
Event-Driven Data Pipelines Serving Critical Decision Making
Bloomberg’s clients rely on up-to-date and precise data to assess the health of mortgage-backed securities. With data updates occurring on tightly controlled schedules—the fourth and sixth business days of each month (BD4 and BD6)—the timeliness and accuracy of Bloomberg’s mortgage workflows are crucial.
Clients expected to be able to view both raw data and aggregates, enabling them to easily compare performance across different mortgage classes. Any delays in the pipeline could impact financial institutions across the U.S., Europe, and Asia, all expecting Bloomberg’s data to be available as soon as it’s published.
The need for a dependable, automated ETL data pipeline became essential, as Bloomberg’s existing setup involved largely manual processes that hindered efficiency and transparency.
Slow, Risky, and Complex
Previously, Bloomberg’s mortgage data pipeline operated manually, requiring spreadsheets and human intervention at multiple stages. This manual setup came with several notable challenges:
- Labor-Intensive: Each bi-monthly data refresh drove a set of high dependency workflows, from extracting raw data to transforming it by computing aggregates and updating securities. Failures or errors at any stage would break the entire pipeline.
- Key-Person Risk: The specialized knowledge needed to execute the workflow resided with a few key individuals, raising the risk of operational delays if those team members were unavailable.
- Complexity and Runtime Inefficiencies: As the mortgage data set grew and new client requirements emerged, the workflow became increasingly complex, necessitating more pre-processing steps and extending the overall runtime. This risked affecting Bloomberg’s ability to meet client expectations for fast data turnaround.
- Limited Observability: Without a centralized orchestration tool, the team lacked visibility into the pipeline’s real-time status. Stakeholders needed to trust that the team was monitoring the process accurately, as the system didn’t allow for easy tracking or troubleshooting.
Why Bloomberg Chose Apache Airflow
To address these issues, Bloomberg evaluated several orchestration tools, including Dagster, Prefect, Faust, and Argo. Ultimately, they selected Apache Airflow for several distinct advantages:
- Python First: Airflow’s Python-based framework integrated seamlessly with Bloomberg’s existing tech ecosystem, making it easier for their engineers to adopt and customize the platform to meet the unique requirements of the mortgage ETL pipeline.
- Fine-Grained Task Control: Airflow’s ability to pause and replay tasks allowed Bloomberg to respond quickly to errors, especially useful in a pipeline with complex dependencies and high data volumes.
- Robust UI and Logging: With Airflow’s rich web interface and logging capabilities, Bloomberg’s engineers gained visibility into the ETL pipeline. They could now monitor active tasks, review logs, and investigate issues without needing to rely on separate tools, improving transparency and troubleshooting speed.
- Community and In-House Expertise: Bloomberg found value in Airflow’s strong open-source community and internal adoption. Several teams at Bloomberg already relied on Airflow, providing a foundation of shared knowledge and support.
Figure 1: By encoding workflows into Airflow DAGs and optimizing task dependencies, Bloomberg has cut run times by 51% Image source.
Airflow Outcomes: Increased Efficiency, Reduced Runtime, and Enhanced Stability
Since implementing Apache Airflow two years ago, Bloomberg’s mortgage pipeline has seen substantial improvements. Most notable is a 51% reduction in run time, enabling the team to meet BD4 and BD6 deadlines more reliably, enhancing service for their global clients. In addition to slashing run times, Bloomberg has eliminated key person risk, and improved workflow monitoring to detect issues sooner.
Bloomberg’s ETL data pipeline is event-driven, kicking off ingestion and pre-processing as soon as the mortgage agencies publish their updated files. The ETL pipeline comprises over 100 different tasks, with many using Airflow’s Bash and Python operators. The company’s engineers have also implemented custom operators within Airflow, such as a message bus operator to handle event-based triggers and an alert operator to notify stakeholders of key events.
The team continues to refine its use of Airflow, conducting dry runs to ensure pipeline integrity before each live data update. This setup serves not only as a robust dev/test/prod environment but also as a “living documentation” of the ETL pipeline, simplifying onboarding and knowledge transfer:
- For development, the team creates isolated sandboxes running the local Airflow executor and SQLite.
- For production, the company uses the Celery Executor, PostgreSQL and RabbitMQ for task distribution with workers distributed over four data centers for resilience. Albeit rare, Bloomberg has encountered data center outages, and the Airflow pipeline continued processing without interruption.
Next Steps
In adopting Airflow, Bloomberg has transformed a once-manual process into a streamlined, automated workflow, offering greater stability, transparency, and efficiency, delivering essential financial data to its clients on time and at scale. You can see the details by watching Bloomberg’s session Streamlining a Mortgage ETL Pipeline with Apache Airflow.
The best way to simplify ETL/ELT pipelines is to use Astro, the industry’s leading managed Airflow service. You can get started for free here.