Note: For more information check out the How to Improve Data Quality with Airflow’s Great Expectations Operator webinar and our Orchestrate Great Expectations with Airflow tutorial.
Webinar links:
1. About Great Expectations
Great Expectations is a shared, open standard for data quality that helps data teams eliminate pipeline debt through data testing, documentation, and profiling.
With Great Expectations, you can express expectations about your data — i.e. that a column contains no nulls, or that a table has twelve columns.
You can then test those expectations against your data and take action based on the results of those tests.
This tool allows you both to use your tests as a form of documentation and keep documentation in form of tests — so that the implicit assumptions about your data can be made explicit and shared around your organization.
2. Why Airflow + Great Expectations?
There’s a strong, documented story of using Airflow and Great Expectations together. It’s an excellent way to add data quality checks into your organization’s data ecosystem.
- They are very different tools and they complement each other
- Easily add data quality to existing DAGs
- Reliably schedule data quality checks
- Take use case–specific action when tests fail
- Observe outputs of checks with Data Docs (a human-readable form of documentation)
Use case: If you have a transformation pipeline, you can use Great Expectations at the start of it, before transforming your data, to make sure that things have loaded correctly — or, you can use Great Expectations after your transformations, to confirm that they succeeded.
3. Great Expectations Vocabulary Cheat Sheet
Datasources A Datasource is a configuration that describes where data is, how to connect to it, and which execution engine to use when running a Checkpoint.
Expectations Expectations, stored within Expectation Suites, provide a flexible, declarative language for describing expected behavior and verifiable properties of data.
Data Context A Data Context represents a Great Expectations project, organizing storage and access for Expectation Suites, Datasources, notification settings, and data fixtures.
Batch Request A Batch Request defines a batch of data to run expectations on from a given data source; it can describe data in a file, database, or dataframe.
Checkpoint Config Checkpoint Configs describe which Expectation Suite should be run against which data, and what actions to take during and after a run.
Checkpoints Checkpoints provide an abstraction for bundling the validation of a batch of data against an Expectation Suite and the actions that should be taken after the validation.
4. Great Expectations Operator v0.1x - V3 API Upgrade
What changed from V2 to V3?
- V3 makes it much easier to write custom expectations and to extend existing expectations. Previously, expectations were all bundled into one file. It was a more complicated setup; now each expectation gets its own file.
- It’s now possible to do things with multi-batch.
- Profiling is improved.
- V3 Checkpoints—a more robust approach to validating your data and to taking action based on that. It’s now possible to pass things dynamically at runtime.
- Users can rely much more heavily on the core library and lean more heavily on the core laboratory functionality.
V3 upgrade includes:
Checkpoint Model The new operator takes a simpler approach to running Great Expectations suites by only running checkpoints.
Data Sources Any Great Expectations-compatible Datasource can be added to a Data Context and run with the operator.
Configurations Default checkpoint values can be overwritten per-checkpoint at runtime with checkpoint kwargs.
5.Demo
Discussed and presented:
Write-Audit-Publish The “why” use case of this sort of DAG. What happens when things fail, who gets mad, how to prevent that pain.
MLFlow A use case for machine learning enthusiasts. What they would want to protect their data/machine learning models from, and why.
See more examples in the video and all the code in this github repo.
And don’t miss the Q&A!