Best Practices and Solutions for Multi-Tenant Airflow
8 min read |
In the ever-evolving landscape of data orchestration, Airflow has emerged
as the de facto standard. As organizations expand their data
initiatives—from ETL processes to machine learning and predictive
analytics—the use of Airflow has grown exponentially. Companies like Snap
and DeliveryHero demonstrate the diverse and powerful applications of
Airflow across multiple teams and complex infrastructures.
However, as teams scale their Airflow usage, the most successful ones
eventually segregate different teams into separate Airflow environments.
Despite advances in Airflow’s multi-tenancy features, it’s still best
practice to provide each team with its own environment. Here’s why:
Airflow Environments Aren’t Designed to be Shared
There’s a host of reasons for this, but it ultimately boils down to three
big ones: executional control vs permissions control,
implications from noisy neighbors, and user productivity.
Execution Control vs Access Control
Out of the box today, Airflow supports DAG level RBAC for controlling
which users can see what DAGs and the actions they can take on them. This
can be helpful in following principles of least access, but ultimately
isn’t enough for true multi-tenancy. For one, users write DAGS that
directly access the underlying database. From the AWS docs about
MWAA:
Apache Airflow is not multi-tenant. While there are access control
measures to limit some features to specific users, which Amazon MWAA
implements, DAG creators do have the ability to write DAGs that can change
Apache Airflow user privileges and interact with the underlying
metadatabase.
So even though DAG level RBAC is helpful for controlling access, it
creates a false sense of security. Users can write DAGs that directly
access the underlying database, potentially altering user privileges.
Moreover, RBAC controls only what a user can see, not the underlying
resources like Connections and Variables.
While additional overhead can mitigate these issues through rigorous CI/CD
and platform engineering, this approach often leads to technical debt and
slower upgrade cycles. It also places a significant burden on platform
teams during downtimes or unexpected failures, as end users might lack
access to necessary debugging views.
Noisy Neighbors
When sharing an environment, the actions of one team may inadvertently
affect another. These “noisy neighbors” situations can result in key
datasets missing their associated SLAs. This could be the result of one
team hogging worker resources during peak loads – if both teams have their
most mission critical workloads scheduled for midnight UTC, they might get
throttled due to worker availability. This situation applies not only to
worker resources, but also to scheduler availability. Airflow has no
mechanism for “schedule priority,” meaning that the most mission-critical
pipelines can’t be prioritized from a scheduling perspective.
Additionally, many of the configurations required to fine tune Airflow for
a specific use case exist at an environment level, not at a DAG
level. Two teams may have conflicting configurations around concurrency
settings, executors, or others needed for a use case, or different
stylistic preferences (default retries, catch up settings, etc.).
Python package conflicts also pose challenges. Data science teams might
need specific ML libraries, while data engineers may require different
versions of underlying packages. Resolving these conflicts can be
cumbersome and hinder productivity – what if two teams want to use
different versions of pandas? Or what if folks are trying to use different
versions of provider packages? If there’s a conflict that exists, DAG
authors may have to bear the burden of refactoring to find a solution
(switch to something like KubernetesPodOperators) or manually create
workarounds .
Upgrades
Lastly, coordinating upgrades is a lot harder if multiple teams have to be
in sync for it. Scheduling any sort of downtime or maintenance window
could cause disruption across multiple teams, making it harder to adopt
new features being released in Airflow. While upgrades may only happen
every quarterly, this same problem can apply to coordinating different
repos. Teams are either left maintaining several different CI/CD scripts,
or left to fit all their DAGs within one repo, both of which carry
undifferentiated maintenance costs.
In the community, many teams have developed their own workarounds for all
these limitations. While this is possible, it entails taking away
control/features from end users and incurring the heavy maintenance costs
of maintaining custom solutions.
Running Multiple Airflows is a Full Time Job
Airflow is a fast moving open source project and there’s a lot coming in
Airflow 3 that will solve some of the pain points listed. However,
today, we’ve found that teams opt into running a monolithic environment
not because it’s optimal, but because of constrained devops resources.
Maintaining even one environment demands significant effort. Scaling this
to provide multiple teams with dev, stage, and prod environments is a
substantial undertaking. Infrastructure costs also add up— despite
Airflow’s relatively light footprint— due to the compute requirements of
long-running services like the scheduler and web server.
Allowing teams to spin up multiple Airflow environments can create
observability issues. “Airflow sprawl” can result in key datasets being
managed by outdated or insecure setups. Centralized logging, monitoring,
and governance become essential, especially when one team’s DAGs serve as
inputs for another.
Astronomer Helps
When folks are looking to run multi-tenant Airflow, they’re looking to
balance the speed of letting teams move at the pace they want to, with the
flexibility to meet users where they are at, without losing the isolation
and observability that mission critical data workloads require. As such,
Astro is designed to give all folks
an easy path to production for their data pipelines.
Isolated Environments with Complete Observability.
The Astro Control plane provides a single point of control and governance
for creating Airflow environments. These environments can all run in
separate clusters and cloud providers, integrating with your IDP for
seamless permissions management. The control plane serves as a single pane
of glass. Users see only the DAGs they have access to, regardless of where
they run.
Admins benefit from dashboards displaying:
-
Location, version and underlying productivity metrics of each
environment (tasks run, number of code deploys, etc.) -
SLAs defined and missed for the underlying DAGs
-
Breakdown of costs associated with each environment
-
Operator usage over time
This approach ensures multi-tenancy for both execution and permission
controls, without sacrificing unified observability.
Ephemeral, Hibernating, and Autoscaling Environments
One of the reasons why observability is so important is that Airflow
environments can be very different from one another. Some use cases
require different compute needs, while others are only experimental and
don’t need a permanent underlying environment. For these short lived
development environments, “branch based deploys” brings a web
development-likedevelopment like experience to data engineering. These
ephemeral development environments’ lifecyclesenvironments lifecycles are
matched to the branch they’re connected to. Additionally, not only do
Airflow workers scale to 0 when they’re not being used, but environments
can be scheduled to scale down all Airflow components to 0 when they’re
not being used - what we call “Hibernation.” Each environment can follow
its own hibernation
schedule,
making the best use of your infrastructure. These can be used in
conjunction with the ephemeral environments to ensure that infrastructure
is only running when it has to.
The automation required to standardize this within an enterprise can be
done directly via the astro-cli, through a fully functional REST API, or
through the official Astronomer Terraform
provider.
dbt Deploys
Giving everyone in an organization a consistent path to production for
data pipelines often involves more than just Airflow. dbt is the standard
tool for the transformation layer of many ETL workloads and is often used
in conjunction with Airflow.
Traditionally, using Airflow to orchestrate dbt jobs required two separate
processes; changes todbt
models
were pushed to one repo, and then the underlying Airflow DAGs that ran
them had to be deployed separately. dbt deploys decreases the overhead of
maintaining two separate CI/CD processes by letting dbt changes be
deployed directly to Airflow without redeploying Airflow DAGs.
Rollbacks for Easy Upgrades, and Safety
Last but not least, despite the decreased risk that environment isolation
affords, we know that unexpected situations do occur. Astro provides the
ability to
rollback
an environment to a previously healthy state. This not only adds to the
safety of trying new things (a new version of Airflow, an updated provider
package, etc.), but also allows for easier resolution of outages.
Ready to optimize your Airflow environments and enhance your team’s
productivity with a multi-tenant approach? Try
Astro
for free and experience seamless multi-tenancy, advanced observability,
and efficient resource management.
…But wait, there’s more?
If you’re already at the point where Airflow is running for multiple teams
OR you’re considering moving this direction but you’re concerned about
tracking delivery of your key datasets and cross team dependencies, Astro
Observe is designed for you. We’ve written a lot about this, but would
love to chat with you to get your feedback!