Glossary
Use this glossary to quickly reference key terms, components, and concepts.
Term | Definition |
---|---|
Airflow connection | An Airflow connection is a set of configurations and credentials that allows Airflow to connect with external tools. |
Airflow UI | The Airflow UI is the primary visual interface for managing DAG and task runs. It contains pages for modifying, monitoring, and troubleshooting an Airflow environment. |
Airflow variable | An Airflow variable is a generic key value pair, such as an API key or file path, that's stored in the Airflow metadata database and that you can reference in a DAG. |
Apache Airflow | Apache Airflow is an open source tool for programmatically authoring, scheduling, and monitoring data pipelines written in Python. Airflow is scalable, configurable, and the industry standard for managing workflows across your ecosystem. |
Data orchestration | Data orchestration is the automated configuration, scheduling, and management of sequential, interdependent tasks involving data. The complexity of modern data pipelines is reflected in the architecture and feature sets of orchestrators, which should not be confused with simple schedulers. In addition to scheduling, orchestration involves the handling of errors and dependencies between tasks, among other aspects of data pipeline management. |
Data product | In the context of observability, a data product is a data asset monitored by an observability tool. For example, a warehouse table, data lake bucket, local database table, or local file containing business-critical customer data could be a data product. |
Dataset | A dataset is a logical grouping of data consumed or produced by tasks in an Airflow DAG. It can be a table, a file, a blob, or a dataframe. Datasets can be used to schedule DAGs with dataset-driven scheduling. |
Decorator | In Python, decorators are functions that take another function as an argument and extend the behavior of that function. In Airflow, decorators provide a simpler way to define Airflow tasks and DAGs compared to traditional operators. |
Deferrable operator | A deferrable operator, also known as an async operator, is an operator that suspends itself while waiting for its condition to be met and resumes on receiving the job status. Tasks that use deferrable operators consume resources more efficiently than sensors because they do not occupy a worker slot when they are in a deferred state. Instead, deferred tasks use the triggerer to poll for job statuses. |
Docker image | An Airflow Docker image is a template used to build containers with Podman or Docker, which run Airflow components and execute DAG code. Both Apache Airflow and Astronomer distribute Docker images for Airflow with different build instructions and pre-installed packages. |
Dynamic DAG | A Dynamic DAG is a DAG that is generated automatically when the scheduler parses the dags folder. You can dynamically create DAGs based on code in one or more Python files or by using tools like gusty or dag-factory . |
Dynamic task | A Dynamic task is a task instance that's generated at runtime based on a set of parameters in DAG code. Dynamic task mapping, the Airflow feature that creates dynamic tasks, allows users to create an arbitrary number of parallel tasks at runtime based on an input parameter. |
Environment variable | An environment variable is a key-value pair that can be used to define an Airflow environment configuration. You typically set environment variables in the airflow.cfg file. |
Executor | An executor is a core process within the Airflow scheduler that is responsible for assigning scheduled tasks to a worker process that will complete the function of a task. Airflow supports multiple executors that differ based on the types of workers they use. |
Hook | A hook is an abstraction of a specific API that allows Airflow to interact with an external system. Hooks are built into many operators, but they can also be used directly in DAG code. |
Jinja Template | Jinja templating is a format that is used to pass dynamic information into task instances at runtime. A jinja templated value is enclosed in double curly braces. |
Lineage | In the context of observability, lineage refers to the collecting of data product metadata and task metadata in pipelines, typically in real time. Lineage consumers use this metadata to represent the movement of data through pipelines as a dynamic map of the relationships between tasks, data products, and systems. Lineage enables the visualization of the upstream and downstream dependencies of business-critical data products, making it easy to identify the tasks and data products that feed them. Lineage also facilitates SLA evaluation, alerting, tagging, and root-cause analysis of bottlenecks, data latency, and staleness. |
Lineage graph | A lineage graph is a map of the relationships between tasks, data products, and systems in a data ecosystem. Lineage graphs enable the visualization of upstream and downstream dependencies, making it easy to trace the path taken by data in critical data products back to their originating tasks and data sources. |
Notifier | A notifier is a custom class that is pre-built into some provider packages and can be used to send notifications to tools like Slack or PagerDuty. |
Observability | Observability refers to the insights gleaned from collecting, visualizing, and analyzing metadata about data products and tasks, typically at runtime. The metadata and analytics available from observability include upstream and downstream dependencies (jobs as well as data products), SLA evaluations, alerts, data freshness and quality metrics, data product ownership information, job duration, and job status. |
Operator | Operators are the building blocks of Airflow DAGs. An operator contains the logic of how data is processed in a pipeline. Each task in a DAG is defined by instantiating an operator. |
Provider | An Airflow provider is a Python package that can be added to core Airflow to extend its functionality. A provider package typically contains modules such as operators, hooks, and sensors to interact with an external service. You can add providers to your Airflow environment by adding their package names to the requirements.txt file of your Astro project. For a list of available providers, see the Astronomer registry. |
Scheduler | The scheduler is the Airflow component responsible for scheduling job and task instances. It is a multi-threaded Python process that determines what tasks need to be run, when they need to be run, and where they are run. |
Sensor | An Airflow Sensor is a special kind of operator that is designed to wait for something to happen. When sensors run, they check to see if a certain condition is met before they are marked successful and let their downstream tasks execute. |
Service Level Agreement (SLA) | SLAs are increasingly important tools organizations use to help ensure the efficient, timely, and reliable delivery of data. SLAs can specify: an expected timeframe in which data should be processed and made available; bounds for pipeline latency, uptime, errors, and throughput; recovery time and recovery point objectives; expected notification and resolution times; and more. |
Service Level Indicator (SLI) | SLIs are quantifiable measures of data quality. Common SLIs of importance to both data teams and data consumers are data freshness and on-time delivery. |
Service Level Objective (SLO) | SLOs are performance targets for Service Level Indicators (SLIs) such as timely on-time delivery rates and freshness rates. |
Tag | In the context of data observability, tags are custom labels that data practitioners and admins can attach to data products and tasks to aid in metadata discovery across pipelines. A common use case for tags is monitoring PII access. |
Task | A task is the basic unit of execution in Airflow. Tasks are arranged into DAGs, and then have upstream and downstream dependencies set between them to express the order in which they should run. |
Task dependency | A task dependency is an instruction that defines whether a task must be completed either before or after another task in the same DAG. Task dependencies are defined in DAG code either explicitly with bitshift operators or implicitly with the TaskFlow API. |
Task group | A task group is a way to visually organize a group of tasks in the Airflow UI. Task groups are defined in DAG code and render as groupings in the Graph view of the Airflow UI. |
TaskFlow API | The TaskFlow API is a framework for using decorators to define DAGs and tasks. Compared to using traditional operators, using the TaskFlow API simplifies the process for passing data between tasks and defining dependencies. |
Triggerer | The triggerer is an optional Airflow component responsible for running deferrable operators when they're in a deferred state. |
Webserver | The webserver is the Airflow component that serves the Airflow UI. It is a Flask server running with Gunicorn. |
XCom | XCom is an Airflow feature that allows you to exchange task metadata or small amounts of data between tasks. XComs are defined by a key, value, and timestamp. |