December 17, 2021

Apache Airflow® for Data Engineers—How to Leverage Data Orchestration

U Ula Rydiger Content Marketing Manager Astronomer

At Astronomer we work directly with data engineers to help them solve their most difficult challenges. By understanding data engineering needs, we aim to deliver the most seamless experience with Apache Airflow® possible, so that teams can focus on the essential tasks that help make their organizations better, one DAG at a time.

Here, we dive into some specifics around the ways in which Apache Airflow® and Astronomer make life easier for data engineers everywhere.

What Is the Role of a Data Engineer?

Data Engineers are responsible for designing and building data pipelines to make data available in a usable format for other data professionals, such as data analysts, scientists, and data architects. They need to be able to quickly understand reasons for pipeline failures and how to fix them, as well as build pipelines that adhere to their organization’s best practices. In smaller companies data engineers also keep an eye on trends that may impact the business. Their role requires both technical and communicational skills.

The most common responsibilities of a data engineer include:

Working closely with a data architect on creating data architecture
Building and maintaining data frameworks and workflows
Gathering, storing, and preparing data in a framework used by a data scientist and a data analyst
Developing processes and automating tasks
Improving data reliability, efficiency, and quality
Preparing data for predictive and prescriptive modeling
Finding hidden patterns, changes, and errors in datasets

How Has the Role of the Data Engineer Changed Over the Years?

"Data engineer" as a job profile wasn't common in organizations until around 2015. Data ingestion was handled much differently than it is today, and data engineering tasks were performed predominantly by Python developers or developers using GUI-based tools like Informatica. Today, with companies focusing on data assets, flexible data ingestions and orchestration have become vital. This has led to the rise of a data engineer—someone who would be responsible for defining data flows.

The role, however, keeps evolving. A few years ago, in order to actually run their code, data engineers had to also act as data infrastructure engineers—setting up and operating both the data platforms and the core infrastructure underneath. The problem was that professionals who are good at running Python and doing data ingestion are often not so good at doing infrastructure and Kubernetes. Companies were constantly searching for a kind of superhero: a data engineer who could do both.

Due to rapid changes in technology and evolving market needs, organizations realized the value of separating these skill sets, so they implemented roles like DevOps and infrastructure engineers. By having them in the team, data engineers can focus on just writing DAGs—identifying how data should be moving from point A to B in a way that brings real value to the business.

Nowadays, as more team members need access to job orchestration and scheduling—for example, data analysts who want to schedule data transformations and SQL queries, or machine learning engineers and data scientists who want to productionize their models—the demand for data workflows has increased exponentially. This is why today, we also have data platform engineers who focus on building frameworks—automating the way DAGs are created, making the work of data engineers (and other team members) easier.

What Are the Most Common Challenges Data Engineers Face?

There are three main challenges data engineers tend to face today:

Managing infrastructure

Data engineers are still required to know a lot about the infrastructure while they should be able to focus on creating data pipelines.

Lack of best practices

The data ecosystem is complex and growing larger every day. Data orchestration tools need to be extensible to accommodate this, but too much flexibility can be confusing. Community-defined best practices help ensure productivity and maintainability.

Testing

Finding the best ways to test DAGs is a challenge as there are so many things to consider here that are specific to an organization and the use case, especially when a DAG goes into production.

How Does Apache Airflow® Help Data Engineers?

Apache Airflow® is a data orchestration toolto programmatically author, schedule, and monitor workflows. The Airflow community is strong, healthy, and vibrant, with over 1700 code contributors—and growing since its first commit in 2014. The most recent version, Airflow 2.2, was released in the second half of 2021, combining two new big features and a whole lot of small quality-of-life improvements that make the tool even more powerful. Learn more about Airflow 2.2 here.

Here are some examples of how Apache Airflow® can help data engineers with some general use cases…

Add alerting to data pipelines

Monitoring tasks and DAGs at scale can be burdensome. Airflow has an easy way toadd notifications and alerts to a workflow. By implementing custom email, Slack, and Microsoft Teams notifications, data engineers can be confident they aren't missing critical events that may require immediate attention.

Perform a variety of data quality and integrity checks

Users can do so easily with Airflow and Great Expectations, by running the expectation suite against a sample dataset in BigQuery.

…and some more specific ones:

Move data from Zendesk to Snowflake

By using the Zendesk API and Airflow's built in S3 and Snowflake operators, data engineers can implement standard patterns to extract three specific Zendesk objects (tickets, users, and organizations) into S3. From S3, those objects are then loaded into Snowflake.

Orchestrate Azure Data Factory pipelines

With Airflow, data engineers can easily interact with Azure Container Instances, Azure Data Explorer, Azure Data Factory, and Azure Blob Storage.

Execute your Talend Jobs

Data Engineers can easily integrate and use Talend together with Airflowfor better data management. Using Airflow for orchestration allows for easily running multiple jobs with dependencies, parallelizing jobs, monitoring run status and failures, and more.

Orchestrating ML models in Amazon SageMaker

By nature, working with ML models in production requires automation and orchestration for repeated model training, testing, evaluation, and likely integration with other services to acquire and prepare data. As it happens, Airflow is the perfect orchestrator for the job, as users can pair it easily with SageMaker.

And many, many more! Head to the Astronomer Registry for more example DAGs and Airflow use cases.

Note: The Astronomer Registry is a community tool for data engineers, scientists, and analysts. It’s the easiest way to get started with your Airflow use case by providing you with the building blocks for your Apache Airflow® data pipelines, including DAGs, modules, and providers.

Common Airflow Challenges

Airflow, being a free OSS tool, doesn’t come without its challenges. For example, as we mentioned before, when using Airflow data engineers may need to run it themselves, which requires deep infrastructure understanding that is not in their core skill set. Secondly, even with Python being the most common language, different data engineers may have different levels of expertise.

Another problem has to do with the ownership and information silo happening when a data engineer needs to support pipelines written by other data engineers, who may have left the company. Additionally, viewing a list of pipelines that they are responsible for is not easily visible in the Airflow UI.

Thirdly, even though Airflow offers pipeline alerts, understanding failures and what to do about them is difficult.

And finally, making sure that pipeline infrastructure can support scaling and designing pipelines may be challenging without DevOps experience.

How Does Astronomer Make Airflow Better?

Astronomer is a managed Airflow service that allows data teams to build, run, and manage data pipelines as code at an enterprise scale. With Astronomer you can run Airflow anywhere you want (AWS, GCP, Azure, on-premise), with easier and faster setup and customer support from top industry experts.

Focus on writing pipelines, not managing Airflow. Get upstream access to the latest features and bug fixes. Significantly reduce dependency on DevOps with a managed service. Get access to best practices, case by case approach, and knowledge straight from the Airflow community.

If you and your team are struggling with Airflow adoption, sign up for Astronomer Office Hours with one of our experts.