Maximizing Data Workflow Efficiency: The Advantages of Using Airflow with Azure Data Factory
In the world of data engineering, combining the strengths of different tools is a must have, not a nice to have. Azure Data Factory (ADF) is the go to, user-friendly, tool that plugs into the Azure ecosystem. When paired with Apache Airflow®, the leading open-source workflow management tool, the duo sets the foundation for data orchestration that can work across the gamut of an enterprise.
This blog post explores how using Airflow alongside ADF can be used side by side in the context of a real world example in a use case where a retail company needs to regularly extract, transform, and load customer data from various sources into their database for analysis and reporting.
Why integrate Airflow with ADF
ADF excels in creating quick and low-code data jobs with an intuitive UX. By layering in Airflow’s orchestration capabilities on top of ADF workflows, developers get end-to-end visibility of their workflows without needing to migrate any jobs. Airflow’s pythonic expressiveness allows it to create more complex, conditional workflows that can change over time. Additionally, you can start managing ADF pipelines through Airflow without needing to alter or replace existing ADF pipelines, making it purely additive to ADF’s existing capabilities.
Another great benefit that Airflow provides when used alongside ADF is its ability to connect with a wide array of services outside the Azure environment and seamlessly link them to ADF workflows. With Airflow, you can integrate operations before or after ADF workflows are triggered, as well as transferring information between ADF and external platforms. This expanded integration enables data pipelines to extend beyond Azure services and sets the stage for hybrid and multi-cloud workloads.
Utilizing Airflow as a control plane for ADF jobs brings together the best of both worlds, ultimately enabling users to monitor their ADF jobs through the Airflow UI alongside any other pipelines they’re running. This synergy ensures efficiency and cohesion in workflow management, providing users with a single pane of glass for both monitoring and alerting.
ADF and Airflow In Practice
The following DAG demonstrates a real world ETL and analysis process in which ADF is used to conduct the transfer operations efficiently between Azure services, before Airflow is used to trigger a Azure Synapse job that consumes that data for an analysis script. The DAG includes multiple instances of the AzureDataFactoryRunPipelineOperator as well as a GreatExpectationsOperator, showcasing how Airflow can be used to implement a cross platform ETL and Data Quality check workflow from a single pane of glass. These operators demonstrate Airflow's capability to trigger and manage multiple ADF pipelines – CustomerExtraction, CleanRawCustomerData, and LoadCleanData, alongside other applications like Great Expectations in one seamless experience. This flexibility to run multiple ADF pipelines in parallel within the same DAG illustrates how Airflow can enhance ADF's data processing capabilities.
RESOURCE_GROUP = 'DemoGroup'
FACTORY_NAME = 'DemoDFYates'
example_dataset = Dataset("AzureSet")
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email_on_failure': False,
'email_on_retry': False,
}
with DAG('azure_services_dag',
default_args=default_args,
description='An Airflow DAG that uses Azure services',
schedule_interval=timedelta(days=1),
start_date=datetime(2023, 10, 31),
catchup=False) as dag:
start = DummyOperator(task_id="start")
ingest_data = AzureDataFactoryRunPipelineOperator.partial(
task_id='ingest_data',
azure_data_factory_conn_id='azure_conn',
resource_group_name=RESOURCE_GROUP,
factory_name=FACTORY_NAME).expand(pipeline_name=['CustomerExtraction','CustomerExtraction1','CustomerExtraction2'])
transform_data = AzureDataFactoryRunPipelineOperator(
task_id='transform_data',
azure_data_factory_conn_id='azure_conn',
resource_group_name=RESOURCE_GROUP,
factory_name=FACTORY_NAME,
pipeline_name='CleanRawCustomerData'
)
store_data = AzureDataFactoryRunPipelineOperator(
task_id='load_into_mssql',
azure_data_factory_conn_id='azure_conn',
resource_group_name=RESOURCE_GROUP,
factory_name=FACTORY_NAME,
pipeline_name='LoadCleanData',
outlets=[example_dataset]
)
GXAnalysis = GreatExpectationsOperator(
schema="Sample",
expectation_suite_name="ConformstoNorm",
execution_engine="default",
conn_id="greatexpectations",
run_name="testrun",
)
# Define dependency chain
start >> ingest_data >> transform_data >> store_data >> GXAnalysis
The use of .partial and .expand methods to dynamically map the ingest_data task showcases Airflow's ability to add scalability and customization to existing ADF pipelines. By using these methods, the DAG can dynamically change at runtime to run a changing set of pipelines, in this case, different customer extraction tasks. This feature is particularly beneficial in scenarios where similar tasks need to be repeated with slight variations that may change each time the pipeline runs, demonstrating how Airflow can scale ADF's capabilities to handle more complex, iterative workflows.
Another example of Airflow’s synergistic benefits comes in the store_data task, where the output of an ADF job is used to trigger another Airflow pipeline through data-driven scheduling. This integration showcases how Airflow can add dynamic, data driven scheduling capabilities to ADF pipelines, without needing to alter the underlying ADF transformations. By specifying outlets, Airflow aids in better managing the flow of data, ensuring that the output of one task is correctly directed and utilized in subsequent tasks.
Airflow further enhances this workflow by allowing the user to have a Great Expectations data quality check triggered by the completion of an ADF workflow, in order to create one seamless end to end pipeline, instead of a fragmented workflow across multiple systems.
In summary, the integration of Airflow with Azure Data Factory (ADF) in our retail company scenario demonstrated significant efficiency gains without disrupting existing data processes. Importantly, no ADF factory jobs were moved or altered in this integration, ensuring that there was no drop in consumption or interruption in ongoing operations. This strategic approach not only preserved the existing workflow integrity but also enhanced it, leading to a more streamlined and effective data management process. The business outcome of this integration was profound: the company could more efficiently manage its customer data, leading to improved data analysis and reporting capabilities. This, in turn, provided valuable insights for better decision-making, directly contributing to enhanced customer satisfaction and business growth.
In summary, the synergy of Airflow and ADF provides a comprehensive toolkit for modern data engineers and businesses. It epitomizes the concept of using the right tool for the right job, combining ease of use with powerful functionality, thus enabling the creation of more efficient, scalable, and flexible data pipelines.
To experience the seamless integration of Airflow with ADF firsthand, try Astro on Azure today!