Airflow in Action: Deploying AI Clusters to 100 Data Centers in 3 Months. Infrastructure Insights from Cloudflare
Cloudflare, a global cloud services provider, is on a mission to build a better Internet. By delivering a unified platform of cloud-native products and developer tools, the company empowers businesses of all sizes to enhance the security and performance of their critical applications while reducing the complexity of traditional network management.
Operating at a massive scale, Cloudflare serves ~30% of the Fortune 1000, blocked an average of 158 billion cyberthreats daily in Q2 2024, and processes an average of 60 million HTTP requests per second. With infrastructure spanning 330 cities across 120+ countries, Cloudflare ensures 95% of the world’s Internet users can access its services within 50 milliseconds.
Operating at this level of scale requires extreme levels of operational automation. This is why the Infrastructure Engineering team at Cloudflare turned to Apache Airflow®, using it to orchestrate the provisioning, diagnostics, and recovery of infrastructure in hundreds of data centers around the world.
At this year's Airflow Summit, Jet Mariscal, Tech Lead in the Cloudflare infrastructure team discussed how they do it in his session Unlocking the Power of Airflow Beyond Data Engineering at Cloudflare. Attendees learned how Airflow powers Cloudflare’s autonomous systems, such as Phoenix and Zero Touch Provisioning (ZTP), enabling efficient diagnostics, recovery, and deployment at scale.
Challenges in Infrastructure Recovery
Operating a vast fleet of hardware comes with its own challenges, especially when dealing with hardware failures. Common issues like disk failures, motherboard problems, and CPU voltage errors require systematic and thorough diagnostics to bring servers back into production.
Previously, Cloudflare relied on manual processes for diagnosing and recovering servers, leading to inefficiencies and inconsistencies. Human error and incomplete documentation often resulted in a growing backlog of broken servers awaiting repair.
Introducing Phoenix: Autonomous Diagnostics and Recovery
To address these challenges, Cloudflare developed Phoenix, an autonomous system that discovers, diagnoses, and recovers servers across their global data centers—all powered by Airflow. Phoenix automates the entire workflow, from powering on broken servers to running diagnostics with a custom Linux image containing Cloudflare’s diagnostic tools. Servers passing the tests are automatically reintegrated into the network, while those that fail are flagged for human intervention. Once repairs are completed, Phoenix seamlessly resumes testing to ensure the servers are ready for production.
Airflow’s versatility makes it an ideal orchestrator for Phoenix. Features like custom hooks and operators allow integration with internal tools, while sensors and macros enable event-driven workflows. Additionally, built-in features like XCom and the TriggerDagRunOperatorstreamline complex, interdependent workflows, ensuring efficient and modular execution.
Zero Touch Provisioning (ZTP): Accelerating GPU Deployment
Cloudflare has also extended Airflow’s capabilities with their Zero Touch Provisioning (ZTP) system, enabling rapid deployment of inference-optimized GPUs across its global network. ZTP autonomously detects new hardware and provisions it without human intervention. This innovation was critical to Cloudflare’s rollout of GPU inference clusters for serverless AI workloads. Within three months, they successfully deployed GPUs to over 100 data centers, and the effort is ongoing, with plans to cover the entire network by year-end.
ZTP’s seamless orchestration through Airflow has significantly reduced deployment times, allowing Cloudflare to scale their infrastructure rapidly to meet growing demand for AI and machine learning capabilities.
Next Steps
Cloudflare’s innovative use of Airflow demonstrates its versatility and power beyond traditional data engineering to automated infrastructure management workflows. From autonomous server recovery to accelerating GPU deployments, Airflow has proven indispensable in optimizing Cloudflare’s global infrastructure. To learn more, watch the full session from the Airflow Summit: Unlocking the Power of Airflow Beyond Data Engineering at Cloudflare.
Infrastructure management is an increasingly common use case for Airflow. In an earlier Airflow in Action blog post, we recapped LinkedIn’s Summit session where its Infrastructure team shared they were using Airflow to orchestrate one million monthly deploys.
Airflow supports a wide variety of different use cases. The best way to get started is on Astro, the industry’s leading managed Airflow service. Head over to the product page where you can learn more about all of the enhancements Astro offers, and sign up for a free trial of the service.