Apache Airflow® Quickstart - Learn Airflow
Learning Airflow: An introduction to Airflow's lean and dynamic pipelines-as-Python-code.
Step 1: Clone the Astronomer Quickstart repository
-
Create a new directory for your project and open it:
mkdir airflow-quickstart-learning && cd airflow-quickstart-learning
-
Clone the repository and open it:
git clone -b learning-airflow --single-branch https://github.com/astronomer/airflow-quickstart.git && cd airflow-quickstart/learning-airflow
Your directory should have the following structure:
.
├── Dockerfile
├── README.md
├── dags
│ └── example_astronauts.py
│ └── example_extract_astronauts.py
├── include
├── packages.txt
├── requirements.txt
├── solutions
│ └── example_astronauts_solution.py
└── tests
└── dags
└── test_dag_integrity.py
Step 2: Start up Airflow and explore the UI
-
Start the project using the Astro CLI:
astro dev start
The CLI will let you know when all Airflow services are up and running.
-
If it doesn't launch automtically, navigate your browser to
localhost:8080
and sign in to the Airflow UI using usernameadmin
and passwordadmin
. -
Explore the DAGs view (the landing page) and individual DAG view page to get a sense of the metadata available about the DAG, run, and all task instances. For a deep-dive into the UI's features, see An introduction to the Airflow UI.
For example, the DAGs view will look like this screenshot:
As you start to trigger DAG runs, the graph view will look like this screenshot:
The Gantt chart will look like this screenshot:
Step 3: Explore the project
This Astro project introduces you to the basics of orchestrating pipelines with Airflow. You'll see how easy it is to:
- Get data from data sources.
- Generate tasks automatically and in parallel.
- Trigger downstream workflows automatically.
You'll build a lean, dynamic pipeline serving a common use case: extracting data from an API and loading it into a database!
This project uses DuckDB, an in-memory database. Although this type of database is great for learning Airflow, your data is not guaranteed to persist between executions!
For production applications, use a persistent database instead (consider DuckDB's hosted option MotherDuck or another database like Postgres, MySQL, or Snowflake).
Pipeline structure
An Airflow instance can have any number of DAGs (directed acyclic graphs), your data pipelines in Airflow. This project has two:
example_astronauts
This DAG queries the list of astronauts currently in space from the Open Notify API, prints assorted data about the astronauts, and loads data into an in-memory database.
Tasks in the DAG are Python functions decorated using Airflow's TaskFlow API, which makes it easy to turn arbitrary Python code into Airflow tasks, automatically infer dependencies, and pass data between tasks.
-
get_astronaut_names
andget_astronaut_numbers
make a JSON array and an integer available, respectively, to downstream tasks in the DAG. -
print_astronaut_craft
andprint_astronauts
make use of this data in different ways. The third task uses dynamic task mapping to create a parallel task for each Astronaut in the list retrieved from the API. Airflow lets you do this with just two lines of code:print_astronaut_craft.partial(greeting="Hello! :)").expand(
person_in_space=get_astronaut_names()
),The key feature is the
expand()
function, which makes the DAG automatically adjust the number of tasks each time it runs. -
create_astronauts_table in duckdb
andload_astronauts_in_duckdb
create a DuckDB database table for some of the data and load the data, respectively.
example_extract_astronauts
This DAG queries the database you created for astronaut data in example_astronauts
and prints out some of this data. Changing a single line of code in this DAG can make it run automatically when the other DAG completes a run.
Step 4: Get your hands dirty!
With Airflow, it's easy to create cross-workflow dependencies. In this step, you'll learn how to:
- Use Airflow Datasets to create a dependency between DAGs so when one workflow ends another begins. To do this, you'll modify the
example_extract_astronauts
DAG to use a Dataset to trigger a DAG run when theexample_astronauts
DAG updates the table that both DAGs query.
Schedule the example_extract_astronauts
DAG on an Airflow Dataset
With Datasets, DAGs that access the same data can have explicit, visible relationships, and DAGs can be scheduled based on updates to these datasets. This feature helps make Airflow data-aware and expands Airflow scheduling capabilities beyond time-based methods such as cron. Downstream DAGs can be scheduled based on combinations of Dataset updates coming from tasks in the same Airflow instance or calls to the Airflow API.
-
Define the
get_astronaut_names
task as a producer of a Dataset. To do this, pass a Dataset object, encapsulated in a list, to the task'soutlets
parameter by altering the first @task in the DAG code:@task(
outlets=[Dataset("current_astronauts")]
)
def get_astronaut_names(**context) -> list[dict]:For more information about Airflow Datasets, see: Datasets and data-aware scheduling in Airflow.
-
Schedule a downstream DAG run using an Airflow Dataset:
Now that you have defined the
get_astronauts
task in theexample_astronauts
DAG as a Dataset producer, you can use that Dataset to schedule downstream DAG runs.Datasets function like an API to communicate when data at a specific location in your ecosystem is ready for use, reducing the code required to create cross-DAG dependencies. For example, with an import and a single line of code, you can schedule a DAG to run when another DAG in the same Airflow environment has updated a Dataset.
To schedule the
example_extract_astronauts
DAG to run whenexample_astronauts
updates thecurrent_astronauts
Dataset, add an import statement to make the Airflow Dataset package available:from airflow import Dataset
-
Then, set the DAG's schedule using the
current_astronauts
Dataset:schedule=[Dataset("current_astronauts")],
-
Rerun the
example_astronauts
DAG in the UI and check the status of the tasks in the individual DAG view. Watch as theexample_extract_astronauts
DAG gets triggered automatically whenexample_astronauts
finishes running.If all goes well, the graph view of the Dataset-triggered DAG run will look like this screenshot:
For more information about Airflow Datasets, see: Datasets and data-aware scheduling in Airflow.
Next Steps: Run Airflow on Astro
The easiest way to run Airflow in production is with Astro. To get started, create an Astro trial. During your trial signup, you will have the option of choosing the same template project you worked with in this quickstart.