Modernizing legacy ETL workloads with Airflow

  • George Yates
  • Jake Roach

As data infrastructure needs grow, the modernization of legacy ETL workloads has become a priority for many organizations. Legacy systems like Informatica have long provided reliable workflows, but modern orchestrators like Apache Airflow now offer enhanced flexibility, scalability, and observability that legacy systems simply can’t match. Transitioning from legacy tools like Informatica to Airflow, however, can be challenging. It requires restructuring GUI designed ETL jobs into a code-based environment, and for organizations with hundreds of workflows, this can seem daunting. Fortunately, tools like DAG Factory can help streamline the process by allowing us to dynamically create Airflow DAGs from configuration files, automating 90% of the process!

Intro to DAG Factory

DAG Factory is an open-source library supported by Astronomer that is designed to define DAGs via YAML files. By leveraging this tool, data teams can migrate workloads from legacy systems without needing to learn how to code Airflow DAG’s, eliminating the biggest barrier to conversion. In this guide, we’ll outline the key steps to converting legacy workloads like Informatica workflows into Airflow DAGs using DAG Factory and explore how this tool enables bulk DAG creation without extensive code.

The Benefits of Migrating Legacy Workloads

There are multiple advantages to transitioning legacy workloads into Airflow. For one, Airflow’s flexibility as a Python-based orchestrator allows for far more complex custom logic than Informatica. It also provides a much wider array of integration capabilities than tools like Informatica, being able to connect to almost any legacy or modern tool. This makes it the perfect solution to manage both old and new tools in your data stack as organizations gradually move towards more modern solutions. Additionally, Airflow is built for scalability, enabling workflows to grow and adapt as data requirements evolve. Beyond flexibility and scalability, the monitoring capabilities in Airflow’s user interface and logging features mean that data engineers can observe workflows in real time, respond quickly to failures, and establish advanced alerting.

Once you migrate your Informatica workflows to Airflow, you unlock the ability to orchestrate them on Astronomer, a managed Airflow platform that provides unparalleled scalability and reliability for modern data pipelines. Unlike legacy tools, which may struggle with scaling under growing data demands, Astronomer’s cloud-native infrastructure dynamically allocates resources to meet the needs of your workflows, ensuring they run efficiently even as complexity increases. With Astronomer, you gain access to features like auto-scaling worker nodes, enhanced monitoring, and seamless integration with CI/CD pipelines, enabling you to manage thousands of DAGs effortlessly. This shift not only modernizes your data ecosystem but also positions your team to handle future data workloads with confidence and efficiency, ensuring your pipelines are always ready to scale alongside your business.

Mapping Legacy Workflows to Airflow Configurations

To effectively replicate Informatica workflows in Airflow, it’s helpful to understand how Informatica concepts map to Airflow configurations. For instance, an Informatica “session” (the core unit of work) typically maps to an Airflow task. Similarly, an entire Informatica workflow corresponds to an Airflow DAG, with additional fields for specifying schedules and dependencies. Each task in Airflow requires an operator (e.g., BashOperator, PythonOperator), which corresponds to Informatica’s specific ETL actions, and dependencies between tasks are mapped as well.

In order to convert an Informatica workflow into an Airflow DAG, the workflow’s structure, schedule, tasks, and dependencies must be mapped. In Informatica, workflows are often defined using graphical interfaces where tasks, dependencies, and schedules are visually arranged and parameterized. However, there is a way to export your workflows and their mappings from Informatica into XML files. These high level configurations can be translated into a YAML-based configuration that DAG Factory then reads to create a DAG in Airflow. The process requires a structured approach to ensure that key elements from the legacy workflows are retained.

An example conversion of an Informatica workflow that loads data, transforms it, and sends a report email could look like this in YAML:

example_dag:
  default_args:
    owner: "airflow"
    email: "alert@example.com"
    retries: 2
  schedule_interval: "0 2 * * *"
  tasks:
    extract_data:
      operator: airflow.operators.bash_operator.BashOperator
      bash_command: "python extract_data.py"

    transform_data:
      operator: airflow.operators.python.PythonOperator
      python_callable: "scripts.transform_data"
      dependencies: [extract_data]
    load_data:
      operator: airflow.operators.bash_operator.BashOperator
      bash_command: "python load_data.py"
      dependencies: [transform_data]
    send_report:
      operator: airflow.operators.email.EmailOperator
      to: "report@example.com"
      subject: "Daily Report"
      dependencies: [load_data]

In this YAML configuration, each key represents a component of the DAG. The main example_dag block specifies default arguments such as the owner and retry settings. Each task—extract_data, transform_data, load_data, and send_report is mapped to an Airflow operator with respective commands or functions to call. Dependencies between tasks are also specified, mirroring the sequence in the original Informatica workflow.

Generating DAGs Dynamically with DAG Factory

With YAML configurations in place, we can dynamically generate Airflow DAGs using a Python script. First, create a folder inside the dags/ directory named /configs/ and put all your YAML configuration files within there. Then run this script, placed in the dags/ directory, which reads through each YAML configuration file and generates a DAG in Airflow:

import os
import dagfactory
config_path = os.path.join(os.path.dirname(__file__), "../configs/")
for config_file in os.listdir(config_path):
    if config_file.endswith(".yaml"):
        dag_factory = dagfactory.DagFactory(os.path.join(config_path, config_file))
        dag_factory.generate_dags(globals())

Scaling Configurations for Bulk Creation

To streamline large-scale migrations, reusable and templated configurations are essential. For similar workflows, templates can reduce redundancy, while parameterized variables (like dataset_name or execution_date) can be used for workflows that vary only slightly. If there are hundreds of workflows to convert, automation scripts can generate YAML configurations programmatically, using metadata exported from Informatica. This approach enables rapid scaling, allowing organizations to transition a substantial number of workflows in a short time frame.

To see how this is done in practice, check out the following Github Repo, where Jake Roach provides a step-by-step guide on how to set up a templated configuration for the bulk creation of dozens or even hundreds of Airflow DAG’s via the DAG Factory!

https://github.com/astronomer/building-dags-with-dag-factory

The best way to get started with Airflow is to build and run your ETL workloads and data pipelines on Astro, our fully managed service. In addition to DAG Factory, Astro also offers:

  • Seamless Local Development: Enhanced integration with the Astro CLI allows developers to test DAGs locally with the same connections used in production.
  • Extensive Library of Pre-Built Integrations: Astronomer Registry offers 1600+ pre-built integrations with popular databases, APIs, and cloud storage platforms, ensuring compatibility with your entire data ecosystem
  • Data observability with Astro Observe: enhanced data lineage, cross-deployment dependency graphs and SLA tracking for your Airflow pipelines

Conclusion

Migrating from Informatica to Airflow using DAG Factory offers significant benefits, from flexibility and scalability to robust monitoring. Legacy workflows can be converted into DAG Factory YAML configurations that generate Airflow DAGs dynamically. This approach minimizes the need for extensive custom code and offers a scalable solution for managing and expanding DAGs as data needs evolve. Whether migrating dozens or hundreds of workflows, DAG Factory enables data teams to efficiently adopt Airflow’s modern orchestration capabilities, simplifying the journey to a more adaptable and observable ETL environment.

Ready to transform your legacy ETL data workloads? Signup for a Free Trial of Astro today to learn how we can help you scale your data engineering efforts with ease.

Build, run, & observe your data workflows.
All in one place.

Get $300 in free credits during your 14-day trial.