Skip to main content

Data lineage overview and concepts

Data lineage is the concept of tracking and observing data flowing through a data pipeline. You can use data lineage to understand data sources, troubleshoot run failures, manage personally identifiable information (PII), and ensure compliance with data regulations when you use Astro Observe.

Lineage on Astronomer

In the Astro UI, the Observe section allows you to view the different Data Products and Assets (DAGs, tasks, and datasets) that your Organization has created. When you view a specific Data Product, you can click Open Graph to render the lineage metadata generated by your DAGs as a dynamic graph. For more information on using the lineage graph, see Leveraging Data Products.

Astro leverages the OpenLineage open source standard to emit lineage metadata. OpenLineage standardizes the definition of data lineage, the metadata that makes up lineage metadata, and the approach for collecting lineage metadata from external systems. It also defines a formalized specification for data lineage.

Core concepts

The following terms are used frequently when discussing data lineage and OpenLineage with Astro.

  • Integration: A means of gathering lineage metadata from a source system such as a scheduler or data platform. For example, the OpenLineage Airflow integration allows lineage metadata to be collected from Airflow DAGs. See OpenLineage documentation for a complete list of OpenLineage integrations.
  • Extractor: In the openlineage-airflow package, an extractor is a module that gathers lineage metadata from a specific hook or operator. For example, extractors exist for the PostgresOperator and SnowflakeOperator, meaning that if openlineage-airflow is installed and configured for your Airflow environment, then lineage metadata is generated automatically from those operators when your DAG runs. An extractor must exist for a specific operator to get lineage metadata from it.
  • Run: A process that consumes or produces datasets. In the context of Airflow, an OpenLineage run corresponds to a task in your DAG as long as your task is an instance of an operator with an extractor. Runs can also represent work completed in other applications that emit lineage metadata, such as a Spark job or a dbt model. Runs appear as nodes on your lineage graphs in the graph view of your Data Product on Astro Observe.
  • Dataset: Any collection of data that your runs interact with. For example, a dataset can correspond to a table in your database or a set of data that you run a Great Expectations check on. A dataset is typically registered as part of your lineage metadata when a run writing to the dataset is completed. For example, when data is inserted into a table.
  • Facet: A piece of lineage metadata about a run, dataset, or run. Also known as a “job facet”.

OpenLineage and Airflow

Using OpenLineage with Airflow gives you insight into your complex data ecosystems and can lead to better data governance. Airflow is a natural place to integrate data lineage because it touches and moves data across many parts of an Organization.

Integrating OpenLineage with Airflow offers several key benefits:

  • Facilitates swift recovery from complex failures. By quickly identifying problems and pinpointing the affected areas, you can resolve issues faster and prevent erroneous decisions based on bad data.
  • Enhances collaboration across teams in your organization. Visualizing the complete lifecycle and usage of datasets streamlines analysis, saving valuable time and effort.
  • Supports compliance with data regulations. Gain a clear understanding of where and how your data is used, ensuring adherence to regulatory requirements.

Was this page helpful?