Note: This webinar was recorded in June 2021. Since then Airflow 2.3 was released adding dynamic tasks to Airflow which has changed and improved many of the patterns shown in this webinar. Please check out our newer Dynamic Tasks in Airflow webinar, the Astronomer Academy module Airflow: Dynamic Task Mapping and our Create dynamic Airflow tasks guide for the latest best practices around dynamic tasks and the Dynamically generate DAGs in Airflow guide and the Airflow: Dynamic DAGs academy module for information on how to dynamically generate DAGs.
The simplest way of creating an Airflow DAG is to write it as a static Python file. However, sometimes manually writing DAGs isn’t practical.
Maybe you have hundreds or thousands of DAGs that do similar things, with just a parameter changing between them. Or maybe you need a set of DAGs to load tables, but don’t want to manually update DAGs every time those tables change.
In these cases, and others, it can make more sense to dynamically generate DAGs. Because everything in Airflow is code, you can dynamically generate DAGs using Python alone.
In this webinar, we’ll talk about when you might want to dynamically generate your DAGs, show a couple of methods for doing so, and discuss problems that can arise when implementing dynamic generation at scale.
In this webinar we cover:
- How Airflow identifies a DAG
- Use cases for dynamically generating DAGs
- Commonly used methods for dynamic generation
- Pitfalls and common issues with dynamic generation
Generating DAGs - The Static Way
Most people who have used Airflow are familiar with defining DAGs statically.
You create a Python file, instantiate your DAG, and define your tasks.
But What Actually Makes a DAG?
- Airflow executes all Python code in the
DAG_FOLDER
and loads anyDAG
object found inglobals()
- This means that any Python code that generates a
DAG
object can be used to create DAGs
A dynamically generated DAG is created when each parsing of the DAG file could create different results.
Why is this useful?
Dynamically generating DAGs can be helpful when you have DAGs that follow a similar pattern, and:
- Want to automate migration from a legacy system to Airflow
- Have only a parameter changing between DAGs
- Have DAGs that are dependent on the changing structure of a source system
- Want to institute standards within DAGs across your team or organization
Ways to Dynamically Generate DAGs: Single File
Create a Python script that lives in your DAG_FOLDER that generates DAG objects.
You may have a function that creates the DAG based on some parameters, and then a loop that calls that function for each input.
Those parameters may come from:
- Within the file
- An Airflow variable
- Airflow connections
- Etc.
Ways to Dynamically Generate DAGs: Multiple Files
Create a Python script (or other script) that actually generates DAG .py files, which are then loaded into your Airflow environment.
This is most straightforward if you are parameterizing the same DAG structure, and want to automatically read those params from YAML, Json, etc.
Pros and Cons
Scalability
Any code in the DAG_FOLDER will be executed on every Scheduler heartbeat. Methods where that code is dynamically generating DAGs, such as the single-file method, are more likely to cause performance issues at scale.
If DAG parsing time > Scheduler heartbeat interval, the scheduler can get locked up and tasks won’t be executed.
Community Tools A notable tool for dynamically creating DAGs from the community is dag-factory. dag-factory is an open source Python library for dynamically generating Airflow DAGs from YAML files.
https://github.com/ajbosco/dag-factory
Code Examples
This repo contains an Astronomer project with multiple examples showing how to dynamically generate DAGs in Airflow. https://github.com/astronomer/dynamic-dags-tutorial