WEBINARS

Best Practices for Writing DAGs in Airflow 2

Watch Video On Demand

Recorded On

Hosted By

  • Kenten Danas
  • Viraj Parekh

Note: This webinar was recorded in November 2021 and Airflow is rapidly evolving with several new exciting features and best practices added since then. We recommend you also check out our up-to-date DAG writing best practices in Apache Airflow® guide.

Webinar links:

Agenda:

  1. What is Apache Airflow®?
  2. Apache Airflow® core principles
  3. The core concept: DAGs
  4. 6+ more best practices

1. What is Apache Airflow®?

Apache Airflow® is a way to programmatically author, schedule, and monitor your data pipelines.

Apache Airflow® was created by Maxime Beauchemin while working at Airbnb as an open-source project in late 2014. It was brought into the Apache Software Foundation’s Incubator Program in March 2016 and saw growing success afterward. By January of 2019, Airflow was announced as a Top-Level Apache Project by the Foundation and is now considered the industry’s leading workflow orchestration solution.

2. Apache Airflow® Core principles

Airflow is built on a set of core ideals that allow you to leverage the most popular open-source workflow orchestrator on the market while maintaining enterprise-ready flexibility and reliability. Obviously in Airflow, your pipelines are written as code, which means you’re going to have the flexibility of Python behind you, and it was designed with scalability and extensibility in mind.

dag-writing-image5

3. Core concept: DAGs

DAG stands for Directed, Acyclic Graph. It is your data pipeline in Airflow! The main rules: your DAGs flow in one direction and have no loops.

dag-writing-image4

Don’t have infinite loops in your code!

4. Best practices

1. Idempotency

Idempotency is the property whereby an operation can be applied multiple times without changing the result.

This isn’t isn’t actually specific to Airflow, but rather applies to all data pipelines. Idempotent DAGs help you recover faster if something breaks and prevent data loss down the road.

2. Use Airflow as an Orchestrator

Airflow was designed to be an orchestrator, not an execution framework.

In practice, this means:

dag-writing-image3

Airflow was designed to play with all these other tools!

Using Provider Packages allows you to orchestrate services with Airflow with very little code. That’s one of the biggest benefits of Airflow.

Code example: One that does not implement the best practice of using airflow as an orchestrator and one that does.

3. DAG design: use Template Fields, Variables, and Macros

Making fields templatable, or using built-in Airflow variables and macros allows them to be set dynamically using environment variables with jinja templating.

This helps with:

A great benefit of Airflow is that many commonly used variables are already built in.

Example of variables straight out of Astronomer registry, you can just reference them in your codes, not really any extra lift for you:

dag-writing-image1

3. DAG design: keep your DAG files clean

Focus on readability and performance when creating your DAG files:

In general, remember all DAG code is parsed every min_file_process_interval.

dag-writing-image6

Code example: A couple of different queries for a set of States that get data for today’s date. These are pulling COVID cases from a database, and I’m going to define and say, I wanted to do it for some certain number of States. Once all of the queries are completed successfully, we’re going to send it.

5. Make Use of Airflow 2 Features

Airflow 2.0+ has many new features that help improve the DAG authoring experience

dag-writing-image2

6. Other Best Practices

Code example: a great example of where task groups come in handy.

For more check out this DAG Best practices written guide and watch the webinar to get some Q&A magic!

Build, run, & observe your data workflows.
All in one place.

Get $300 in free credits during your 14-day trial.