Airflow in Action: Scaling Climate Intelligence with Zero-Downtime on HPC Clusters at Meteosim

At the Airflow Summit, Eloi Codina Torras — Product Owner and Software Developer at Meteosim — showcased his company’s integration of Apache Airflow® and the Slurm workload manager to streamline high-performance computing (HPC) workflows.
Attendees learned how Meteosim orchestrates complex simulations across multiple Slurm-managed compute clusters. Their solution uses deferrable operators and a custom-built integration to optimize resource utilization, maintain service uptime, and monitor jobs while also simplifying data pipeline creation for their product engineers.
Weather Forecasting at Scale: Meteosim’s Computational Challenge
Operating in more than 45 countries, Meteosim specialises in meteorological and environmental services. The company combines the strong scientific expertise of its physicists and engineers with advanced meteorological knowledge and cutting-edge numerical modeling tools. This unique approach enables its clients in industries such as energy, mining, and chemical production to make data-driven decisions, enhancing environmental risk management and operational excellence.
Meteosim’s meteorological and air quality forecasting relies on workflows that orchestrate vast amounts of data—global models, emissions, and observations—through four stages: acquisition, preprocessing, simulation, and postprocessing. The simulation step is particularly resource-intensive; forecasting weather for a region as large as California can require up to 24 hours of computation on their hybrid infrastructure, which combines on-premises bare-metal HPC clusters and cloud-based virtual machines.
At the heart of their infrastructure is Slurm, an open-source workload manager widely used in scientific computing. Slurm (or Slurm Workload Manager, formerly known as Simple Linux Utility for Resource Management) intelligently allocates resources, ensuring tasks with high priority or specific memory, CPU, and node requirements are executed optimally. However, Slurm itself lacks orchestration capabilities, making it necessary to pair it with a tool like Airflow.
From Crontab Chaos to Orchestrated Excellence
Before adopting Airflow, Meteosim relied on an unwieldy Crontab (cron table) file with thousands of entries, rudimentary monitoring tools, and no support for restarting failed pipelines. Fixing errors, particularly after long simulations, was time-consuming and error-prone.
These limitations drove Meteosim to adopt Airflow in 2021, which provided a scalable solution for orchestrating their pipelines.
Integrating Airflow and Slurm
The challenge lay in integrating Airflow with Slurm while preserving Slurm’s resource allocation strengths. Meteosim developed a custom integration using deferrable operators, enabling Airflow to trigger and monitor jobs on Slurm-managed clusters.

Figure 1: Integrating Slurm with Airflow. Image source.
The integration architecture features daemons on HPC primary nodes for job submission and monitoring, alongside a custom-built Slurm operator and triggers in Airflow. These components communicate through Redis, which serves as a messaging layer to manage job states and ensure reliability. For efficiency, the system was optimized to fetch job states in bulk, reducing overhead during peak loads with hundreds of concurrent tasks.
Simplifying Pipeline Creation for Product Engineers
To make Airflow even more accessible, Meteosim built an internal web tool that automates pipeline creation.
Engineers can define tasks and parameters—such as resource requirements and environment variables—through an intuitive interface without needing to drop to code. This standardization streamlines onboarding and ensures consistency across pipelines.
Results: Zero Downtime Across 6,000 Pipelines
The integration of Airflow and Slurm has delivered significant benefits for Meteosim:
- Scalability: They run nearly all workflows through Slurm, executing around 6,000 DAG runs daily.
- Reliability: With redundant infrastructure and deferrable operators, high availability is maintained across HPC clusters and virtual machines. During his session, Eloi reported the company had experienced zero downtime as a result of the integration.
- Ease of Use: The internal web tool simplifies pipeline management, reducing onboarding time and enhancing productivity.
- Continuous Improvement: Regular Airflow upgrades ensure Meteosim stays ahead with new features, enabling the company’s engineers to further improve workflow orchestration.
Next Steps
Meteosim’s innovative use of Airflow and Slurm demonstrates the power of combining data orchestration and HPC to tackle some of the most computationally intensive workloads. For a detailed walkthrough of their architecture and lessons learned, watch the replay Airflow and multi-cluster Slurm working together.
From HPC to ML/AI to analytics and data-driven software, Airflow is an incredibly extensible tool, supporting almost any use case. The best way to get started is with Astro, the fully managed Airflow service from Astronomer. You can sign up here for your free Astro trial.