Skip to main content

An introduction to Apache Airflow®

Apache Airflow® is an open source tool for programmatically authoring, scheduling, and monitoring data pipelines. Every month, millions of new and returning users download Airflow and it has a large, active open source community. The core principle of Airflow is to define data pipelines as code, allowing for dynamic and scalable workflows.

This guide offers an introduction to Apache Airflow and its core concepts. You'll learn about:

  • Why you should use Airflow.
  • Common use cases for Airflow.
  • How to run Airflow.
  • Important Airflow concepts.
  • Where to find resources to learn more about Airflow.
Other ways to learn

There are multiple resources for learning about this topic. See also:

Assumed knowledge

To get the most out of this guide, you should have an understanding of:

Why use Airflow

Apache Airflow is a platform for programmatically authoring, scheduling, and monitoring workflows. It is especially useful for creating and orchestrating complex data pipelines.

Data orchestration sits at the heart of any modern data stack and provides elaborate automation of data pipelines. With orchestration, actions in your data pipeline become aware of each other and your data team has a central location to monitor, edit, and troubleshoot their workflows.

One of the core tenets of Airflow that sets it apart from other data orchestration tools is defining data pipelines as Python code. Modern data teams seek to define their workflows in code to:

  • Manage workflows with their existing change-control and version-control tooling.
  • Bulk-create/update workflows too numerous, large, or dynamic to mange from a GUI.
  • Define re-usable paramaterizable workflow sections.

Airflow provides many additional benefits, including:

  • Tool agnosticism: Airflow can connect to any application in your data ecosystem that allows connections through an API. Prebuilt operators exist to connect to many common data tools.
  • High extensibility: Since Airflow pipelines are written in Python, you can build on top of the existing codebase and extend the functionality of Airflow to meet your needs. Anything you can do in Python, you can do in Airflow.
  • Infinite scalability: Given enough computing power, you can orchestrate as many processes as you need, no matter the complexity of your pipelines.
  • Dynamic data pipelines: Airflow offers the ability to create dynamic tasks to adjust your workflows based on the data you are processing at runtime.
  • Vibrant OSS community: With millions of users and thousands of contributors, Airflow is here to stay and grow. Join the Airflow Slack to become part of the community.
  • Observability: The Airflow UI provides an immediate overview of all your data pipelines and can provide the source of truth for workflows in your whole data ecosystem.

Screenshot of the main view of the Airflow UI with seven enabled workflows.

Airflow use cases

Many data professionals at companies of all sizes and types use Airflow. Data engineers, data scientists, ML engineers, and data analysts all need to perform actions on data in a complex web of dependencies. With Airflow, you can orchestrate these actions and dependencies in a single platform, no matter which tools you are using and how complex your pipelines are.

Symbolic graph with Airflow shown as the center of the data ecosystem, with arrows pointing out from Airflow to logos of a variety of common data tools.

Some common use cases of Airflow include:

Of course, these are just a few examples, you can orchestrate almost any kind of batch workflows with Airflow.

Running Airflow

There are many ways to run Airflow, some of which are easier than others. Astronomer recommends:

  • Using the open-source Astro CLI to run Airflow locally. The Astro CLI is the easiest way to create a local Airflow instance running in containers and is free to use for everyone.
  • Using Astro to run Airflow in production. Astro is a fully-managed SaaS application for data orchestration that helps teams write and run data pipelines with Apache Airflow at any level of scale. A free trial is available.

All Airflow installations include the mandatory Airflow components as part of their infrastructure: the webserver, scheduler, database, and executor. See Airflow components for more information.

Airflow concepts

To navigate Airflow resources, it is helpful to have a general understanding of the following Airflow concepts.

Pipeline basics

  • DAG: Directed Acyclic Graph. An Airflow DAG is a workflow defined as a graph, where all dependencies between nodes are directed and nodes do not self-reference, meaning there are no circular dependencies. For more information on Airflow DAGs, see Introduction to Airflow DAGs.
  • DAG run: The execution of a DAG at a specific point in time. A DAG run can be one of four different types: scheduled, manual, dataset_triggered or backfill.
  • Task: A step in a DAG describing a single unit of work.
  • Task instance: The execution of a task at a specific point in time.
  • Dynamic task: An Airflow task that serves as a blueprint for a variable number of dynamically mapped tasks created at runtime. For more information, see Dynamic tasks.

The following screenshot shows one simple DAG, called example_astronauts, with two tasks, get_astronauts and print_astronaut_craft. The get_astronauts task is a regular task, while the print_astronaut_craft task is a dynamic task. The grid view of the Airflow UI shows individual DAG runs and task instances, while the graph displays the structure of the DAG. You can learn more about the Airflow UI in An introduction to the Airflow UI.

Screenshot of the Airflow UI Grid view with the Graph tab selected showing a DAG graph with a regular and a dynamic task as well as a DAG run and Task instance.

Click to view the full DAG code used for the screenshot
"""
## Astronaut ETL example DAG

This DAG queries the list of astronauts currently in space from the
Open Notify API and prints each astronaut's name and flying craft.
There are two tasks, one to get the data from the API and save the results,
and another to print the results. Both tasks are written in Python using
Airflow's TaskFlow API, which allows you to easily turn Python functions into
Airflow tasks, and automatically infer dependencies and pass data.

The second task uses dynamic task mapping to create a copy of the task for
each Astronaut in the list retrieved from the API. This list will change
depending on how many Astronauts are in space, and the DAG will adjust
accordingly each time it runs.

For more explanation and getting started instructions, see our Write your
first DAG tutorial: https://www.astronomer.io/docs/learn/get-started-with-airflow

![Picture of the ISS](https://www.esa.int/var/esa/storage/images/esa_multimedia/images/2010/02/space_station_over_earth/10293696-3-eng-GB/Space_Station_over_Earth_card_full.jpg)
"""

from airflow import Dataset
from airflow.decorators import dag, task
from pendulum import datetime
import requests


# Define the basic parameters of the DAG, like schedule and start_date
@dag(
start_date=datetime(2024, 1, 1),
schedule="@daily",
catchup=False,
doc_md=__doc__,
default_args={"owner": "Astro", "retries": 3},
tags=["example"],
)
def example_astronauts():
# Define tasks
@task(
outlets=[Dataset("current_astronauts")]
) # Define that this task updates the `current_astronauts` Dataset
def get_astronauts(**context) -> list[dict]:
"""
This task uses the requests library to retrieve a list of Astronauts
currently in space. The results are pushed to XCom with a specific key
so they can be used in a downstream pipeline. The task returns a list
of Astronauts to be used in the next task.
"""
r = requests.get("http://api.open-notify.org/astros.json")
number_of_people_in_space = r.json()["number"]
list_of_people_in_space = r.json()["people"]

context["ti"].xcom_push(
key="number_of_people_in_space", value=number_of_people_in_space
)
return list_of_people_in_space

@task
def print_astronaut_craft(greeting: str, person_in_space: dict) -> None:
"""
This task creates a print statement with the name of an
Astronaut in space and the craft they are flying on from
the API request results of the previous task, along with a
greeting which is hard-coded in this example.
"""
craft = person_in_space["craft"]
name = person_in_space["name"]

print(f"{name} is currently in space flying on the {craft}! {greeting}")

# Use dynamic task mapping to run the print_astronaut_craft task for each
# Astronaut in space
print_astronaut_craft.partial(greeting="Hello! :)").expand(
person_in_space=get_astronauts() # Define dependencies using TaskFlow API syntax
)


# Instantiate the DAG
example_astronauts()

It is a core best practice to keep tasks as atomic as possible, meaning that each task performs a single action. Additionally, tasks are idempotent, which means they produce the same output every time they are run with the same input. See DAG writing best practices in Apache Airflow.

Writing pipelines

Airflow tasks are defined in Python code. You can define tasks using:

  • Decorators (@task): The TaskFlow API allows you to define tasks by using a set of decorators that wrap Python functions. This is the easiest way to create tasks from existing Python scripts. Each call to a decorated function becomes one task in your DAG.
  • Operators (XYZOperator): Operators are classes abstracting over Python code designed to perform a specific action. You can instantiate an operator by providing the necessary parameters to the class. Each instantiated operator becomes one task in your DAG.

There are a couple of special types of operators that are worth mentioning:

  • Sensors: Sensors are Operators that keep running until a certain condition is fulfilled. For example, the HttpSensor waits for an Http request object to fulfill a user defined set of criteria.
  • Deferrable Operators: Deferrable Operators use the Python asyncio library to run tasks asynchronously. For example, the DateTimeSensorAsync waits asynchronously for a specific date and time to occur. Note that your Airflow environment needs to run a triggerer component to use deferrable operators.

Some commonly used building blocks, like the BashOperator, the @task decorator, or the PythonOperator, are part of core Airflow and automatically installed in all Airflow instances. Additionally, many operators are maintained separately to Airflow in Airflow provider packages, which group modules interacting with a specific service into a package.

You can browse all available operators and find detailed information about their parameters in the Astronomer Registry. For many common data tools, there are integration tutorials available, showing a simple implementation of the provider package.

Additional concepts

While there is much more to Airflow than just DAGs and tasks, here are a few additional concepts and features that you are likely to encounter:

  • Airflow scheduling: Airflow offers a variety of ways to schedule your DAGs. For more information, see DAG scheduling and timetables in Airflow.
  • Airflow connections: Airflow connections offer a way to store credentials and other connection information for external systems and reference them in your DAGs. For more information, see Manage connections in Apache Airflow.
  • Airflow variables: Airflow variables are key-value pairs that can be used to store information in your Airflow environment. For more information, see Use Airflow variables.
  • XComs: XCom is short for cross-communication, you can use XCom to pass information between your Airflow tasks. For more information, see Passing data between tasks.
  • Airflow REST API: The Airflow REST API allows Airflow to interact with RESTful web services.

Resources

Next steps

Now that you have a basic understanding of Apache Airflow, you are ready to write your first DAG by following the Get started with Apache Airflow tutorial.

Was this page helpful?