The Texas Rangers Win Baseball Games with Analytics on Astro
The Texas Rangers accelerated their data delivery time by 24 hours without incurring any additional costs of new compute on Astro.
A longtime harbor for meticulous record-keeping and statistical analysis, Major League Baseball (MLB) has quietly participated in the big data phenomenon since its dawn. Explicitly outlined in Moneyball, the pressure from the front office to deliver better and better performance results from players, coaches, and teams has reached a fever pitch over the last decade. Teams are relentlessly focused on identifying and honing any competitive advantage they can — putting data teams at the heart of America’s favorite pastime.
While traditional stats like box scores, batting averages, on-base-percentage, etc., dominate baseball sports commentary, behind the scenes, data teams are now charged with analyzing everything from players’ biomechanical performance, to micro-changes in weather conditions throughout the field, to massive amounts of position, speed, and rotational data from recording technologies like Hawk-eye. Modern organizations are even leveraging Natural language processing (NLP) and AI models to run analysis on a variety of text data, including articles and scouting reports, to help find their next top amateur prospect.
The end goal has remained the same: win more games. However, the method of aggregating the right data to deliver simple, straightforward recommendations (“bench this player, he’s got a developing hamstring injury”; “change up who’s playing second base this game, the opposing lineup will eat the normal guy’s lunch”) to team managers and owners has skyrocketed in complexity. One team at the forefront of this transformation is the Texas Rangers.
Journey to Airflow
While the Rangers started out with a small data team in 2015 - only a couple of employees to sustain a front-office organization of dozens of people - they were still tasked with a time-sensitive initiative: getting the Rangers in tip-top shape to take advantage of all the new methods of collecting, organizing, and analyzing data that was flowing in at an unprecedented rate. The Rangers’ data team started off with a largely on-prem stack but was on the clock to move to the cloud and establish a more reliable, adaptable, and innovative platform.
Of particular interest to this team was utilizing open-source technologies, for a variety of reasons. With the rapidly evolving nature of big data analysis in baseball, the team wanted to avoid vendor lock-in, and the primary drivers behind the transformation wanted to leverage the community benefits associated with open-source software (OSS). With these guiding principles in mind, they decided on Apache Airflow® as the main tool for managing their data pipelines after experiencing setbacks with Cron and other time-based scheduling tools. Now, Airflow is the unifying layer across their data stack, able to adapt to the wide variety of use cases that the team supports.
Challenge and Move to Astro
With Airflow at the center of their data universe, the Rangers were benefiting from real-time game data streaming, comprehensive player health reporting, predictive analytics of everything from pitch spin to hit trajectory, and more. But despite the strides they made with Airflow, there were notable scaling difficulties with the open-source tool.
The data team approached Astronomer with a problem: the pipeline used for processing live game analytics was maxing out its allotted CPU, and the resulting bottleneck delays were causing the pipeline to take over 20 minutes to complete. This meant that the analytics they were serving to their team about the game were delayed, sometimes so much so that instead of feeding their players analytics immediately post-game about their performance, they were forced to deliver them well after the game was over or even the next day. As a result of this delay, opposing teams with faster analytics delivery times had a potential advantage. Therefore, they came to Astronomer to find a solution that would get their analytics pipelines running as fast as possible.
How Astro helped
Through a working session with the MLB data engineers and our in-house airflow experts, the Astronomer team determined that worker queues would be the best solution to avoid maxing out CPU and cutting pipeline completion time in half. Worker queues are a proprietary Astro feature that allows you to create dedicated worker pools for certain task groups, where only tasks in that task group use that particular worker, while all other tasks use the default worker node type.
This allows you to use task-optimized compute, with CPU intensive tasks able to use more highly performant worker node type, while your less intensive tasks can use a cheaper, less performant worker node type. For this MLB team, this meant putting the CPU intensive pipeline on its own dedicated worker queue, giving it access to the compute resources it needs without affecting the rest of the pipelines. By implementing this, the Astronomer team was able to cut the pipeline completion time by over 80% to around 3 minutes, and completely avoid maxing out CPU. Additionally, because of the extra compute the Astronomer team was able to free up with this implementation, the Data Engineering team is now able to process 4 additional pipelines in parallel rather than being forced to implement them in series. Astronomer also helped the team implement data aware scheduling, so that their pipelines could process data as it becomes available, which was much more efficient for their use case than standard time based scheduling.
Successful outcome
As a result of these changes, this MLB team was able to accelerate their analytics pipeline's completion time such that they can now serve analytics immediately after the game, rather than the next day. This gives their players and coaching staff the informational edge they need to adapt to opponents quickly, while their performance is fresh, and win more games.
Additionally, they were able to make all of these changes just by optimizing the usage of existing compute resources, so there was no additional cost incurred for a massive increase in performance. It also meant more reliable delivery of their pipelines, as every task now has the compute it needs to complete consistently on time. With these optimizations, the data team also now has freed up the room in their budget to scale Airflow usage even further, allowing more teams to be brought into Airflow to enhance their productivity.