August 16, 2024

Introducing Apache Airflow 2.10

Kenten Danas Senior Manager, Developer Relations Astronomer

Every couple of months, the Apache Airflow project releases a new version with numerous features, improvements, and bug fixes that enhance functionality. The release of Airflow 2.10 brings greater flexibility and expansion of some of the most widely used features. This release contains more than 40 great new features, over 80 improvements, and over 40 bug fixes.

While we are, of course, excited about every new feature and improvement, we’re particularly jazzed about the new dataset improvements. Datasets are one of the most popular Airflow features and are often a key part of implementing rapidly growing use cases like MLOps and GenAI. The updates in this release make the feature more flexible and easy to use, and we are sure the community will be quick to adopt them.

But 2.10 is definitely not only about datasets! This blog post will walk you through everything you need to know about this release, so you don’t miss out on any of the exciting additions and changes.

Dataset Enhancements

Datasets and data-aware scheduling were originally released in Airflow 2.4. They provide a way for DAGs that access the same data to have explicit, visible relationships and get scheduled based on updates to these datasets. Using datasets allows you to create smaller DAGs instead of large monolithic ones and allows different teams to maintain their own DAGs, even if data is shared across them, while gaining increased visibility into DAG dependencies.

In the 2023 Airflow survey, almost 50% of users indicated they have adopted dataset functionality. The previous release, Airflow 2.9, brought the biggest updates yet to this important feature, and 2.10 builds upon those improvements.

Dynamic Dataset Definition

In previous versions of Airflow, dataset inlets and outlets were required to be set during DAG parsing time; in other words, they were static. This design helped avoid poorly formed dataset URIs but did not allow the flexibility of setting inlets and outlets during task execution, which could be helpful in cases where you wanted to use datasets in combination with other features like dynamic task mapping.

To make this feature more flexible, Airflow 2.10 brings a new class, DatasetAlias, that can accept dataset values and is resolved at runtime. The alias allows you to define downstream schedules or inlets without knowing the exact name of the dynamic dataset ahead of time. To use a dataset alias, you simply set it as an outlet for your task and then associate dataset events to it by defining outlet_events. For example, you might have:

@task(outlets=[DatasetAlias("my-task-outputs")]) 
def my_task(*, ds, outlet_events):
      outlet_events["my-task-outputs"].add(Dataset(f"s3://bucket/my-task/{ds}")
)

In this case, the ds part of the dataset URI will be filled in at runtime based on the information passed to the task. Since you don’t know that information ahead of time, you can schedule a downstream DAG based on the alias:

DAG(..., schedule=DatasetAlias("my-task-outputs"))

This feature is very flexible and is designed to work with older implementations of datasets as well. Even if you use an alias, you can still schedule based on a dataset URI, and you can add multiple events to a single alias.

Add Metadata to Dataset Events

One other benefit of the new dataset alias feature is that you can now attach metadata to an event using either the extra parameter or the Metadata class.

@task(outlets=[DatasetAlias("my-task-outputs")])
def my_task(*, ds):
    s3_dataset = Dataset(f"s3://bucket/my-task/{ds}")
    yield Metadata(s3_dataset, extra={"k": "v"}, alias="my-task-outputs")

This allows you to save information about data that was processed, such as the number of records processed in that task, a new model accuracy score after training, or the filenames of any processed files. This metadata can also be used by tasks in downstream DAGs that interact with the same dataset.

Dataset UI Updates

To support the new dataset alias feature, the datasets page has gotten a refresh to focus on dataset events. The new view has richer information about each dataset event, including the source, DAG runs that were triggered by that dataset, and extras.

A refreshed datasets page in Airflow 2.10 with a focus on dataset events.

The dependency graph and list of all datasets in that Airflow instance are now on separate tabs, making it cleaner and easier to navigate.

Apache Airflow 2.10 Datasets view showing the new Dependency Graph tab.

Dataset events are also now shown in the Details tab of each DAG run and in the DAG graph.

Apache Airflow Datasets Details tab showing Dataset Events and Task Instance notes.

Screenshot of Apache Airflow 2.10 showing successful task dataset_with_extra_by_context.

User Interface Improvements

Nearly every Airflow release brings great UI updates that improve the experience of working with Airflow, and Airflow 2.10 is particularly exciting in this regard. In addition to the dataset UI updates mentioned above, this release brings a highly requested and anticipated dark mode to Airflow.

By simply toggling the icon on the right side of the navigation bar, you can switch easily between light and dark mode.

Screenshot of Apache Airflow 2.10 showing the new dark/light theme toggle button.

In addition, 2.10 brings other convenient features to the UI, including a new button to reparse DAGs on demand, thanks to the addition of a DAG reparsing endpoint to the API.

Apache Airflow 2.10 screenshot showing the new reparse DAGs on demand button, now available with the new DAG reparsing API endpoint.

You also have more visibility in the 2.10 UI, such as task failed dependencies on the details page and a better XCom display thanks to the view being rewritten as a proper JSON react view.

Screenshot of Apache Airflow 2.10 showing the Details tab of task failed dependencies.

Screenshot of Apache Airflow 2.10 showing the improved XCom display.

Lineage Enhancements

Data lineage can help with everything from understanding your data sources, to troubleshooting job failures, to managing PII, to ensuring compliance with data regulations. OpenLineage, the industry standard framework for data lineage, has a robust Airflow integration that allows you to have more insight into the operation and structure of the complex data ecosystems that Airflow orchestrates.

The OpenLineage Airflow integration has been around and in use for a while. However, it previously only gathered lineage information from explicitly implemented operators. One large gap was the PythonOperator, which, despite being the most widely used Airflow operator, had no support for lineage.

Now, with AIP 62, instrumentation has been added to get lineage information from important hooks, so that popular operators like the PythonOperator, as well as the TaskFlow API, and Object Storage API can emit lineage information. This is a key step forward in closing the gaps for lineage in Airflow that will translate to real-world benefits for users.

Multiple Executor Configuration

Picking an executor is one of the important choices you must make when setting up your Airflow instance. Each executor (Celery and Kubernetes being the most common) has advantages and disadvantages, balancing factors like latency, isolation, and compute efficiency. In previous versions of Airflow, you could pick just one executor for your Airflow instance, potentially leading to tradeoffs for your workflows.

Now, Airflow supports configuring multiple executors concurrently, so you can have the best of both worlds. Once multiple executors are set up in your Airflow config, you can assign specific tasks to the one that optimizes resource utilization, latency, and custom execution requirements.

Note that if you are an Astronomer customer, Astro does not currently support configuring multiple executors for one Deployment. However, using worker queues with the Celery executor offers similar customization for task execution.

Other Noteworthy Features and Updates

There are lots more notable updates in 2.10 to be aware of, including:

Deferrable operators can now start execution directly from the triggerer without going to the worker. For certain operators, like sensors, this is more efficient and can save teams time and money.
As part of AIP 64, task instance history is now kept for all task instance tries, not only the most recent attempt. This information is available to users now, but, excitingly, it is also part of the development of DAG versioning, which will come in a future Airflow release.
Important executor logs are now sent to the task logs. If the executor fails to start a task, the relevant error messages will be accessible to the user in the task logs, making debugging much easier.

And even these updates barely scratch the surface. Regardless of how you use Airflow, there’s something for you in 2.10.

Get Started with Airflow 2.10

Airflow 2.10 has way more features and improvements than we can cover in a single blog post. To learn more, check out the full release notes, and join us for a webinar on August 22nd that will cover the new release in more detail.

To try Airflow 2.10 and see all the great features for yourself, get started with a Free 14-Day Trial of Astro. We offer same-day support for all new Airflow releases, so you don’t have to wait to take advantage of the latest and greatest features.