Apache Airflow® components
When working with Apache Airflow®, understanding the underlying infrastructure components and how they function can help you develop and run your DAGs, troubleshoot issues, and successfully run Airflow.
In this guide, you'll learn about the core components of Airflow and how to manage Airflow infrastructure for high availability. Some of the components and features described in this topic are unavailable in earlier Airflow versions.
There are multiple resources for learning about this topic. See also:
- Astronomer Academy: Airflow: Basics module.
Assumed knowledge
To get the most out of this guide, you should have an understanding of:
- Basic Airflow concepts. See Introduction to Apache Airflow.
Core components
The following Apache Airflow core components are running at all times:
- Webserver: A Flask server running with Gunicorn that serves the Airflow UI.
- Scheduler: A Daemon responsible for scheduling jobs. This is a multi-threaded Python process that determines what tasks need to be run, when they need to be run, and where they are run. Within the scheduler, the Executor configuration property determines where and how tasks are run.
- Database: A database where all DAG and task metadata are stored. This is typically a Postgres database, but MySQL and SQLite are also supported.
If you run Airflow locally using the Astro CLI, you'll notice that when you start Airflow using astro dev start
, it will spin up three containers, one for each of the core components.
In addition to these core components, there are a few situational components that are used only to run tasks or make use of certain features:
- Worker: The process that executes tasks, as defined by the executor. Depending on which executor you choose, you may or may not have workers as part of your Airflow infrastructure.
- Triggerer: A separate process which supports deferrable operators. This component is optional and must be run separately. It is needed only if you plan to use deferrable (or "asynchronous") operators.
The following diagram illustrates component interaction:
Executors
You can use pre-configured Airflow executors, or you can create a custom executor. There are several pre-configured executors in Airflow for local and production use cases. In production, Astronomer recommends using either of the following executors:
-
CeleryExecutor: Uses a Celery backend (such as Redis, RabbitMq, Redis Sentinel or another message queue system) to coordinate tasks between pre-configured workers. This executor is ideal for high volumes of shorter running tasks or in environments with consistent task loads. The CeleryExecutor is available as part of the Celery provider.
-
KubernetesExecutor: Calls the Kubernetes API to create a separate Kubernetes pod for each task to run, enabling users to pass in custom configurations for each of their tasks and use resources efficiently. The KubernetesExecutor is available as part of the CNCF Kubernetes provider. This executor is ideal in the following scenarios:
- You have long running tasks that you don't want to be interrupted by code deploys or Airflow updates.
- Your tasks require very specific resource configurations.
- Your tasks run infrequently, and you don't want to incur worker resource costs when they aren't running.
For local development, Astronomer recommends using the LocalExecutor. It executes tasks locally inside the scheduler process and does not require workers. It supports parallelism and hyperthreading.
For more information, see Apache Airflow® Executors. Astro customers can choose and configure their executors when creating a deployment. See Manage Airflow executors on Astro for more information.
Managing Airflow infrastructure
All Airflow components should be run on an infrastructure that is appropriate for the requirements of your organization. For example, using the Astro CLI to run Airflow on a local computer can be helpful when testing and for DAG development, but it is insufficient to support running DAGs in production.
The following resources can help you manage Airflow components:
- OSS Production Docker Images
- OSS Official Helm Chart
- Managed Airflow on Astro
Scalability is also an important consideration when setting up your production Airflow environment. See Scaling out Airflow.
High availability
Airflow can be made highly available, which makes it suitable for large organizations with critical production workloads. Airflow 2 introduced a highly available scheduler, meaning that you can run multiple Scheduler replicas in an active-active model. This makes the scheduler more performant and resilient, eliminating a single point of failure within your Airflow environment.
Running multiple schedulers requires additional database configuration. See Running More Than One Scheduler.