WEBINARS

Intro: Getting Started with Airflow

Watch Video On Demand

Recorded On

Hosted By

  • Kenten Danas

Note: There is a newer version of this webinar available: Airflow 101: How to get started writing data pipelines with Apache Airflow®.

By Kenten Danas, Lead Developer Advocate at Astronomer

Webinar Links:

1. What is Apache Airflow®?

Apache Airflow® is one of the world’s most popular data orchestration tools — an open-source platform that lets you programmatically author, schedule, and monitor your data pipelines.

Apache Airflow® was created by Maxime Beauchemin in late 2014, and brought into the Apache Software Foundation’s Incubator Program two years later. In 2019, Airflow was announced as a Top-Level Apache Project, and it is now considered the industry’s leading workflow orchestration solution.

Key benefits of Airflow:

Apache Airflow® Core principles

Airflow is built on a set of core principles — and written in a highly flexible language, Python — that allow for enterprise-ready flexibility and reliability. It is highly secure and was designed with scalability and extensibility in mind.

2. Airflow core components

The infrastructure:

airflow-101-recap-image6

3. Airflow core concepts

DAGs

A DAG (Directed Acyclic Graph) is the structure of a data pipeline. A DAG run either extracts, transforms, or loads data - becoming a data pipeline, essentially.

DAGs must flow in one direction, which means that you should always avoid having loops in the code.

airflow-101-recap-image2

Each task in a DAG is defined by an operator, and there are specific downstream or upstream dependencies set between tasks.

airflow-101-recap-image3

Tasks

airflow-101-recap-image7

A task is the basic unit of execution in Airflow. Tasks are arranged into DAGs, and then have upstream and downstream dependencies set between them to express the order in which they should run. Best practice: keep your tasks atomic by making sure they only do one thing.

A task instance is a specific run of that task for a given DAG (and thus for a given data interval). Task instances also represent what stage of the lifecycle a given task is currently in. You will hear a lot about task instances (TI) working with Airflow.

Operators

Operators are the building blocks of Airflow. They determine what actually executes when your DAG runs. When you create an instance of an operator in a DAG and provide it with its required parameters, it becomes a task.

airflow-101-recap-image1

Providers

airflow-101-recap-image5

Airflow providers are Python packages that contain all of the relevant Airflow modules for interacting with external services. Airflow is designed to fit into any stack: you can also use it to run your workloads in AWS, Snowflake, Databricks, or whatever else your team uses.

Most tools already have community-built Airflow modules, giving Airflow spectacular flexibility. Check out the Astronomer registry to find all the providers.

The following diagram shows how these concepts work in practice. As you can see, by writing a single DAG file in Python using an existing provider package, you can begin to define complex relationships between data and actions.

airflow-101-recap-image4

4. Best practices for beginners

  1. Design Idempotent DAGs
    DAG runs should produce the same result regardless of how many times they are run.
  2. Use Providers
    Don’t reinvent the wheel with Python Operators unless needed. Use provider packages for specific tasks. And go to the Astronomer Registry for Providers, everything there is to know about providers is right there.
  3. Keep Tasks Atomic
    When designing your DAG, each task should do a single unit of work. Use dependencies and trigger rules to schedule as needed.
  4. Keep Clean DAG Files
    Define one DAG per .py file. Keep any code that isn’t part of the DAG definition (e.g., SQL, Python scripts) in an /include directory.
  5. Use Connections
    Use Airflow’s Connections feature to keep sensitive information out of your DAG files.
  6. Use Template Fields
    Airflow’s variables and macros can be used to update DAGs at runtime and keep them idempotent.

For more, check out this DAG Best practices written guide and watch the webinar, including the Q&A session.

5. Demo

Watch the Demo to learn:

You can find the code from the webinar in this Github repo.

Build, run, & observe your data workflows.
All in one place.

Get $300 in free credits during your 14-day trial.

Get Started Free