Airflow in Action: Data Engineering Insights from Burns & McDonnell
From Data Demand to Data Delivery in <24 Hours
Founded in 1898 Burns and McDonnell has grown to become one of the world’s largest design firms. It has 14,000 professionals working from 75+ offices around the globe. All of them rely on data to support the delivery of complex construction and engineering projects. However, the company’s data platform hadn’t kept pace with demands from the business.
In the talk Orchestration from Zero to 100 at this year’s Airflow Summit, Burns and McDonnell data engineer Bonnie Why discusses the specific challenges the company faced and the steps they took using Apache Airflow® as an orchestrator to create a scalable and trustworthy data platform. The result is a single source of truth where all of the company’s data is searchable, reliable, and accessible to the employee-owners and the projects that need it.
In this post, we’ll recap Bonnie’s session before providing resources to learn more.
Wrangling the Wild West of Data
On joining Burns and McDonnell in 2023, Bonnie’s first project was to get on top of the company’s disconnected and sprawling data estate. Rather than evolve systematically, the data platform had been cobbled together to service ad-hoc requests from the business. As a result, Bonnie faced a platform that was:
- Hard to maintain. Multiple workflows and processes ingesting and working with data from a multitude of source systems. This included apps, APIs, spreadsheets, file systems and more, with teams working in disconnected silos.
- Hard to trust. A lack of metadata and lineage meant that data was poorly defined and wasn’t easily discoverable. This resulted in huge duplication of both data and effort as every team reinvented the wheel to serve new requests from the business.
- Hard to change. A lack of visibility into the system meant it was impossible to understand dependencies between systems, who was using the data and for what. With little testing, most bugs were found by users as they consumed the data.
These three challenges defined the capabilities Burns and McDonnell needed from their new data platform. It had to be:
- Scalable to keep pace with the business. A key strategy to achieving that was centralizing the platform so that data engineering was able to provide uniform access to data.
- Reliable to ensure only the right data was used. Central to that goal was to create a company-wide understanding of data through lineage and the use of a shared language across teams.
- Evolvable to support new cases, business requirements, and emerging technologies.
Maturing Airflow Usage
Bonnie started her data platform journey by using Apache Airflow with an application that needed to ingest 1,500 tables from one of the company’s Oracle databases. This experience helped Bonnie and the data engineering team lay the foundations for building a centralized data platform. In her talk, Bonnies discusses how their use of Airflow matured. This included:
- Simplifying code by moving from using custom operators to Airflow’s task flow API.
- Meeting new use cases by designing around different patterns and classes.
- Increasing engineering velocity by Implementing new development methods along with the use of ephemeral environments for testing.
- Performance improvements with file chunking and dynamic task parallelization using Airflow’s expand() function.
From Zero to 100 with Airflow
Today, Airflow serves a classic ETL use case at Burns and McDonnell - ingesting source data into the data platform’s landing zone running on Azure BLOB storage, orchestrating Databricks and dbt to process and transform the data, and storing it in Delta Live Tables for serving to data consumers. Alation is used for data cataloging.
Figure 1: Airflow orchestration powers the heavy-lifting behind the Burns & McDonnell data platform. Image source
As Bonnie says, “everyone’s data is important and everyone’s use case is a priority”. Now her data engineering team can turn around requests from the business in less than 24 hours, providing reliable, trustworthy, and accessible data to consumers across the company. Airflow has been central to achieving this transformation.
Bonnie wrapped up her session by describing the three benefits Airflow has provided to data engineering at Burns & McDonnell:
- The team is now scalable to develop and ship data faster. Airflow provides a code-first, developer-centric experience founded on Python and software engineering best practices.
- Data pipelines and workflows are reliable. Airflow provides efficient and resilient scheduling.
- The data platform is adaptable and evolvable. A large open source community quickly enhancing the core product (note that Airflow 3.0 is a prime example of this) and provider packages means the team can quickly embrace new use cases and new technologies.
Learn More
You can get all of the details by watching the Summit replay session Airflow at Burns & McDonnell | Orchestration from zero to 100.
To get the best Airflow experience, build and run your workflows on the Astro managed service. Astro enables companies to place Airflow at the core of their data operations, providing ease of use, scalability, and enterprise-grade security, to ensure the reliable delivery of mission-critical data pipelines.