September 30, 2024

Best Practices and Solutions for Multi-Tenant Airflow

Viraj Parekh Co-Founder, VP of Sales Engineering Astronomer

In the ever-evolving landscape of data orchestration, Airflow has emerged as the de facto standard. As organizations expand their data initiatives—from ETL processes to machine learning and predictive analytics—the use of Airflow has grown exponentially. Companies like Snap and DeliveryHero demonstrate the diverse and powerful applications of Airflow across multiple teams and complex infrastructures.

However, as teams scale their Airflow usage, the most successful ones eventually segregate different teams into separate Airflow environments. Despite advances in Airflow’s multi-tenancy features, it’s still best practice to provide each team with its own environment. Here’s why:

Airflow Environments Aren’t Designed to be Shared

There’s a host of reasons for this, but it ultimately boils down to three big ones: executional control vs permissions control, implications from noisy neighbors, and user productivity.

Execution Control vs Access Control

Out of the box today, Airflow supports DAG level RBAC for controlling which users can see what DAGs and the actions they can take on them. This can be helpful in following principles of least access, but ultimately isn’t enough for true multi-tenancy. For one, users write DAGS that directly access the underlying database. From the AWS docs about MWAA:

Apache Airflow is not multi-tenant. While there are access control measures to limit some features to specific users, which Amazon MWAA implements, DAG creators do have the ability to write DAGs that can change Apache Airflow user privileges and interact with the underlying metadatabase.

So even though DAG level RBAC is helpful for controlling access, it creates a false sense of security. Users can write DAGs that directly access the underlying database, potentially altering user privileges. Moreover, RBAC controls only what a user can see, not the underlying resources like Connections and Variables.

While additional overhead can mitigate these issues through rigorous CI/CD and platform engineering, this approach often leads to technical debt and slower upgrade cycles. It also places a significant burden on platform teams during downtimes or unexpected failures, as end users might lack access to necessary debugging views.

Noisy Neighbors

When sharing an environment, the actions of one team may inadvertently affect another. These “noisy neighbors” situations can result in key datasets missing their associated SLAs. This could be the result of one team hogging worker resources during peak loads – if both teams have their most mission critical workloads scheduled for midnight UTC, they might get throttled due to worker availability. This situation applies not only to worker resources, but also to scheduler availability. Airflow has no mechanism for “schedule priority,” meaning that the most mission-critical pipelines can’t be prioritized from a scheduling perspective.

Additionally, many of the configurations required to fine tune Airflow for a specific use case exist at an environment level, not at a DAG level. Two teams may have conflicting configurations around concurrency settings, executors, or others needed for a use case, or different stylistic preferences (default retries, catch up settings, etc.).

Python package conflicts also pose challenges. Data science teams might need specific ML libraries, while data engineers may require different versions of underlying packages. Resolving these conflicts can be cumbersome and hinder productivity – what if two teams want to use different versions of pandas? Or what if folks are trying to use different versions of provider packages? If there’s a conflict that exists, DAG authors may have to bear the burden of refactoring to find a solution (switch to something like KubernetesPodOperators) or manually create workarounds .

Upgrades

Lastly, coordinating upgrades is a lot harder if multiple teams have to be in sync for it. Scheduling any sort of downtime or maintenance window could cause disruption across multiple teams, making it harder to adopt new features being released in Airflow. While upgrades may only happen every quarterly, this same problem can apply to coordinating different repos. Teams are either left maintaining several different CI/CD scripts, or left to fit all their DAGs within one repo, both of which carry undifferentiated maintenance costs.

In the community, many teams have developed their own workarounds for all these limitations. While this is possible, it entails taking away control/features from end users and incurring the heavy maintenance costs of maintaining custom solutions.

Running Multiple Airflows is a Full Time Job

Airflow is a fast moving open source project and there’s a lot coming in Airflow 3.0 that will solve some of the pain points listed. However, today, we’ve found that teams opt into running a monolithic environment not because it’s optimal, but because of constrained devops resources. Maintaining even one environment demands significant effort. Scaling this to provide multiple teams with dev, stage, and prod environments is a substantial undertaking. Infrastructure costs also add up— despite Airflow’s relatively light footprint— due to the compute requirements of long-running services like the scheduler and web server.

Allowing teams to spin up multiple Airflow environments can create observability issues. “Airflow sprawl” can result in key datasets being managed by outdated or insecure setups. Centralized logging, monitoring, and governance become essential, especially when one team’s DAGs serve as inputs for another.

Astronomer Helps

When folks are looking to run multi-tenant Airflow, they’re looking to balance the speed of letting teams move at the pace they want to, with the flexibility to meet users where they are at, without losing the isolation and observability that mission critical data workloads require. As such, Astro is designed to give all folks an easy path to production for their data pipelines.

Isolated Environments with Complete Observability.

The Astro Control plane provides a single point of control and governance for creating Airflow environments. These environments can all run in separate clusters and cloud providers, integrating with your IDP for seamless permissions management. The control plane serves as a single pane of glass. Users see only the DAGs they have access to, regardless of where they run.

Admins benefit from dashboards displaying:

Location, version and underlying productivity metrics of each environment (tasks run, number of code deploys, etc.)
SLAs defined and missed for the underlying DAGs
Breakdown of costs associated with each environment
Operator usage over time

This approach ensures multi-tenancy for both execution and permission controls, without sacrificing unified observability.

Ephemeral, Hibernating, and Autoscaling Environments

One of the reasons why observability is so important is that Airflow environments can be very different from one another. Some use cases require different compute needs, while others are only experimental and don’t need a permanent underlying environment. For these short lived development environments, “branch based deploys” brings a web development-likedevelopment like experience to data engineering. These ephemeral development environments' lifecyclesenvironments lifecycles are matched to the branch they’re connected to. Additionally, not only do Airflow workers scale to 0 when they’re not being used, but environments can be scheduled to scale down all Airflow components to 0 when they’re not being used - what we call “Hibernation.” Each environment can follow its own hibernation schedule, making the best use of your infrastructure. These can be used in conjunction with the ephemeral environments to ensure that infrastructure is only running when it has to.

The automation required to standardize this within an enterprise can be done directly via the astro-cli, through a fully functional REST API, or through the official Astronomer Terraform provider.

dbt Deploys

Giving everyone in an organization a consistent path to production for data pipelines often involves more than just Airflow. dbt is the standard tool for the transformation layer of many ETL workloads and is often used in conjunction with Airflow.

Traditionally, using Airflow to orchestrate dbt jobs required two separate processes; changes todbt models were pushed to one repo, and then the underlying Airflow DAGs that ran them had to be deployed separately. dbt deploys decreases the overhead of maintaining two separate CI/CD processes by letting dbt changes be deployed directly to Airflow without redeploying Airflow DAGs.

Rollbacks for Easy Upgrades, and Safety

Last but not least, despite the decreased risk that environment isolation affords, we know that unexpected situations do occur. Astro provides the ability to rollback an environment to a previously healthy state. This not only adds to the safety of trying new things (a new version of Airflow, an updated provider package, etc.), but also allows for easier resolution of outages.

Ready to optimize your Airflow environments and enhance your team's productivity with a multi-tenant approach? Try Astro for free and experience seamless multi-tenancy, advanced observability, and efficient resource management.

…But wait, there’s more?

If you’re already at the point where Airflow is running for multiple teams OR you’re considering moving this direction but you’re concerned about tracking delivery of your key datasets and cross team dependencies, Astro Observe is designed for you. We’ve written a lot about this, but would love to chat with you to get your feedback!

Best Practices and Solutions for Multi-Tenant Airflow

Airflow Environments Aren’t Designed to be Shared

Running Multiple Airflows is a Full Time Job

Astronomer Helps

Build, run, & observe your data workflows. All in one place.

Build, run, & observe
your data workflows.
All in one place.