OpenLineage
An open framework for data lineage and observability.
OpenLineage makes data pipelines observable by automatically collecting and correlating detailed information about their operation and the movement of data within them.
Using OpenLineage, organizations can build observability into their data ecosystem. Integrating with Apache Airflow® and other orchestration platforms, OpenLineage monitors the creation, movement, and transformation of datasets. The information collected by OpenLineage is essential for managing today’s distributed data environments, enabling engineers to quickly find, fix, and prevent complex operational issues.
Astro uses OpenLineage to trace the end-to-end journey of each dataset, consolidate quality metrics, and surface operational issues, helping to ensure the availability and trustworthiness of business-critical data.
OpenLineage Design Principles
OpenLineage was created by data engineers who know what happens when a pipeline becomes too large for a single brain to comprehend, and is collaboratively developed by an open community that shares a common vision for the project. It is designed to be:
Operational
Navigating a large and complex pipeline becomes a whole lot easier once you have an up-to-date lineage graph that shows the flow of data among its various jobs, datasets, and systems. OpenLineage integrates directly with orchestration platforms and warehouses, observing data relationships as they are formed and ensuring teams can quickly determine the scope of emerging issues and respond with accuracy.
Platform agnostic
The collection of tools and platforms available to data engineers continues to grow, allowing them to do increasingly imaginative and powerful things with the data they collect. Lineage is the thread that ties it all together. OpenLineage provides a lingua franca for observation that spans today’s tools and — thanks to its flexibility — tomorrow’s as well.
Extensible
OpenLineage is an open specification managed by an open community. Extensions can be created independently, and can easily become part of the core specification. The community is made up of members with diverse employers, skill sets, and locations, and is always looking for new voices and fresh ideas.
To learn more, check out this webinar, which covers the basics of OpenLineage and shows how it works together with Airflow to collect data lineage as tasks run.
Built on a strong and growing community of contributors.
Use Cases
The information collected by OpenLineage can be used for:
-
OpenLineage for Data Observability
Quickly learn when a job has failed, a dataset is stale, or the shape of the data has unexpectedly changed. Create a map of your data ecosystem that makes it easy to understand and communicate complex operational situations.
-
Root Cause Analysis
Use data lineage to correlate and contextualize issues that span multiple platforms, quickly separating causes from symptoms and reducing the time it takes to bring critical datasets back online after a failure.
-
Impact Planning
Plan changes armed with a complete view of downstream jobs and datasets, assessing their impact across the entire data ecosystem and taking steps to ensure that they are carried out with minimal risk.