December 5, 2024

Airflow in Action: ML and Data Engineering Insights from the Analytics-Obsessed World of MLB

M Matthew Keep Astronomer

Professional sports have always been a great proving ground for the latest technology innovations, and none more so than the data-intense world of Major League Baseball (MLB). The Airflow Summit 2024 featured two sessions from MLB teams that showcased Airflow in Action:

The 2023 World Series Champions Texas Rangers shared their winning playbook in the Summit’s second day keynote. They showcased how data pipelines orchestrated by Airflow running on the Astro managed service enable fast decision-making and drive competitive advantage for players and coaches.
The Philadelphia Phillies extensively use Machine Learning (ML) for player evaluation, acquisition, and development. The team presented the suite of tools they have developed to train, test, evaluate, and deploy ML models — orchestrated entirely by Apache Airflow®.

In this blog post, we’ll recap key highlights from each session and provide you further resources to learn more.

Before that, why has baseball been so early in understanding the opportunities presented by advanced analytics and AI?

Ahead of the (Curve) Ball: How Data Revolutionized Baseball

Baseball was one of the first professional sports to adopt data analytics because of its inherently data-rich nature and the discrete, measurable events that occur during each game. Every pitch, hit, and play can be individually recorded and analyzed, creating a vast repository of statistics.

Baseball analytics was pioneered by the likes of Bill James with sabermetrics in the 1970s and the Oakland Athletics' "Moneyball" strategy proving its competitive edge in the 2000s. Today, the sport integrates advanced analytics and AI to drive decision-making and strategy. Teams leverage technologies like Statcast to analyze player movements and ball trajectories, enabling precise evaluations, optimized defenses, and tailored training. Predictive analytics guide scouting, roster management, and injury prevention, while real-time data shapes in-game strategies like pitch selection and batting orders.

Baseball's embrace of these advanced technologies continues to revolutionize the sport, setting new standards for data-driven decision-making in athletics. Let’s explore two examples.

World Series Winning Orchestration Strategies with the Texas Rangers

At the time of his keynote session, Oliver Dykstra worked for the reigning World Series MLB champions. Therefore he felt it is reasonable to describe himself as the World Series data engineering champion. We totally agree.

In his session Oliver described how the Texas Rangers first data platform relied on cron for scheduling, and therefore had no awareness or visibility of dependencies across the data supply chain. For example, data processing and transformation jobs would kick-off before all of the data was ready, compromising quality of analytics and wasting expensive compute resources. On top of this, the system was also incredibly hard to troubleshoot when failures occurred, consuming valuable data engineering cycles.

At the same time, the data team had to contend with players and coaches who had an ever growing appetite for data and insights, and they wanted them faster. But they had little trust in the outputs generated by the data platform.

Fixing the Foundations with Airflow. Going Faster with Astro

The data engineering team chose Airflow as the backbone of the Texas Rangers' data ecosystem, leveraging its open-source foundation, active community, and frequent feature releases. As shown in the slide from Oliver’s session, Airflow enables seamless data ingestion from diverse sources, including player biomechanics, sensor data, and weather measurements. It orchestrates complex ETL pipelines across Databricks, with Monte Carlo ensuring data quality. The results are consumed by data science teams, ML models, scouts, players, and team management.

Figure 1: Unified data platform built on Astro, the managed Airflow service from Astronomer. Image source.

Oliver discussed how the data engineering team takes advantage of specific Airflow capabilities such as data-aware scheduling. This ensures pipelines only run when specific data assets have been updated, backed with sophisticated execution flows controlled by Airflow’s conditional branching. This ensures high quality, trustworthy data is produced quickly and cost effectively.

By running Airflow in the Astro managed service from Astronomer.io, the data engineering team gets access to advanced features that save them both time and money. For example:

Dynamically rightsizing Airflow workers to ensure the correct levels of computational resources are available when needed.
Providing deep telemetry to simplify troubleshooting and remediation. (Note that since Oliver’s keynote, Astronomer has released Astro Observe bringing richer visibility and actionable intelligence to data pipelines).
CI/CD integration enabling teams to get their pipelines into production faster and more reliably Pipeline stages are code reviewed and tested before being promoted, and quickly rolled back if issues occur.

Building on these advanced features, Oliver states in his How the Texas Rangers use data analytics to score a game-winning advantage Diginomica interview that Astro reduces pipeline processing times by 80%. Oliver went on to say:

“Having a managed deployment of Airflow means they take care of everything behind the scenes. If you don't have to think about it, you know your partner is doing the right thing. And during my day-to-day work, I don't have to think about Astronomer and Airflow – and it's like no sweat off my back.”

Figure 2: Astro provides efficient monitoring with cross-deployment visibility and health from a central dashboard. Image source.

Oliver wrapped up his keynote by stating how Airflow running in Astro had contributed to the Texas Rangers World Series win, providing the team with a competitive advantage through advanced analytics, timely reports, and powerful data visualizations. You can learn more by watching Oliver’s keynote: Winning Strategies: Powering a World Series Victory with Airflow Orchestration.

Orchestrating ML Pipelines at the Philadelphia Phillies

At the Airflow Summit, Mike Hirsch and Sophie Keith—engineers in the Phillies ML team—shared their journey orchestrating machine learning (ML) pipelines. The session details how the team overcame infrastructural challenges and evolved their workflow orchestration practices, culminating in a scalable platform powered by Apache Airflow.

The Challenge: Bridging the Gap Between Analysis and Engineering

The Phillies’ MLE team serves as the critical link between analysts and software engineers, working to deliver reliable and fast insights for player evaluation, acquisition, and development. However, their previous workflows faced a number of challenges:

Outdated Airflow Setup: The team relied on an old Airflow version with poor Kubernetes pod management, creating debugging difficulties.
Siloed Data Practices: Data was inconsistently stored and accessed, leading to performance bottlenecks.
Inefficient Collaboration: Cross-team coordination relied on manual processes and Slack channels, creating delays and confusion.
Tooling Misuse: Existing tools like BigQuery, DBT, and MLflow were underutilized or improperly configured.

These pain points not only slowed down model development but also limited the team's ability to support the growing demands of baseball research.

The Solution: A Unified Framework for ML Orchestration

To address these challenges, the Phillies undertook a comprehensive overhaul of their ML infrastructure, centering their solution around Airflow as the orchestrator for the entire ML lifecycle. This approach allowed them to integrate a robust suite of tools that streamlined their workflows.

MLflow became the foundation for standardized storage of model artifacts, parameters, and evaluation metrics, ensuring consistency and traceability. They used DBT to transform model outputs and intermediary datasets into BigQuery, enhancing accessibility and efficiency for data analysis. Additionally, Kubernetesand MLServer facilitated streamlined model training and deployment, providing seamless access to inference APIs.

A pivotal aspect of their transformation was the development of an internal ML development platform. This platform introduced an SDK to standardize model workflows, isolated environments to prevent dependency conflicts, and a CLI for efficient training and deployment. By abstracting away infrastructure complexities, this platform enabled analysts to focus on research and analysis, significantly improving productivity and fostering innovation within the team.

The Results: Faster, More Reliable ML Pipelines

The Phillies’ new approach has yielded transformative results:

Improved Efficiency: Analysts no longer needed to write code for predictions or manage artifact generation, saving weeks of development time.
Enhanced Reliability: Standardized pipelines reduced debugging complexity and ensured seamless model retraining and deployment.
Clear Ownership: By isolating workflows, analysts and engineers could focus on their respective domains without stepping on each other's toes.
Scalable Orchestration: Airflow’s DAG templates enabled multi-model pipelines to adapt dynamically, retraining models as needed with minimal intervention.

The Phillies’ journey offers valuable lessons for data teams seeking to streamline ML workflows. Centralizing orchestration with Airflow as the backbone of their ML lifecycle proved to be a game-changer, simplifying coordination across tools and teams while enhancing overall efficiency. Establishing a consistent framework for managing model artifacts, dependencies, and outputs not only accelerated development but also made debugging more straightforward and reliable. Additionally, creating isolated environments eliminated dependency conflicts, allowing analysts to iterate on their models more quickly and focus on delivering impactful insights.

A prime example of the ML Engineering team’s new capabilities is their "Super Model DAG” shown below. With dependencies codified, when one model is retrained, downstream models automatically adapt without manual coordination.

Figure 3: The “Super Model DAG," orchestrates a complex multi-model pipeline where dependencies between models are codified. Image source.

You can get all of the details on the Phillies Airflow journey by watching their session A Game of Constant Learning & Adjustment: Orchestrating ML Pipelines at the Philadelphia Phillies.

Note that the upcoming Apache Airflow 3.0 release offers additional enhancements to MLOps workloads, including:

Support for advanced AI inference execution policies.
Extended MLOps support for backfills as models evolve.

Next Steps

Airflow is helping some of the highest profile and most demanding teams turn data into competitive advantage. How do you get the best out of Airflow?

As the Texas Rangers data team will tell you, try Airflow out for free on Astro, the industry’s leading managed service for data orchestration.
The Phillies mentioned their small team encountered complexity in the Airflow upgrade cycle. With Astro, teams can take advantage of in-place Airflow upgrades and rollbacks, all fully-managed for them. They get the benefits of new Airflow versions as soon as they are released while freeing valuable data engineering cycles to focus on more meaningful work for the business.

If like the Phillies you are using dbt and Airflow, take a look at dbt on Astro. Through the integration you gain complete visibility into your dbt tasks within Airflow, making it easier to detect and troubleshoot issues. In Astro, you can deploy dbt code independently from Airflow DAGs, simplifying your CI/CD processes and reducing deployment errors.