Apache Airflow® Quickstart - Generative AI
Generative AI: An introduction to generative AI model development with Airflow.
Step 1: Clone the Astronomer Quickstart repository
-
Create a new directory for your project and open it:
mkdir airflow-quickstart-genai && cd airflow-quickstart-genai
-
Clone the repository and open it:
git clone -b generative-ai --single-branch https://github.com/astronomer/airflow-quickstart.git && cd airflow-quickstart/generative-ai
Your directory should have the following structure:
.
├── Dockerfile
├── README.md
├── airflow_settings.yaml
├── dags
│ └── example_vector_embeddings.py
├── include
│ ├── custom_functions
│ │ └── embedding_func.py
│ └── data
│ └── galaxy_names.txt
├── packages.txt
├── requirements.txt
├── solutions
│ └── example_vector_embeddings_solution.py
└── tests
└── dags
└── test_dag_integrity.py
Step 2: Start up Airflow and explore the UI
-
Start the project using the Astro CLI:
astro dev start
The CLI will let you know when all Airflow services are up and running.
-
If it doesn't launch automtically, navigate your browser to
localhost:8080
and sign in to the Airflow UI using usernameadmin
and passwordadmin
. -
Explore the DAGs view (landing page) and individual DAG view page to get a sense of the metadata available about the DAG, run, and all task instances. For a deep-dive into the UI's features, see An introduction to the Airflow UI.
For example, the DAGs view will look like this screenshot:
As you start to trigger DAG runs, the graph view will look like this screenshot:
The Gantt chart will look like this screenshot:
Step 3: Explore the project
Apache Airflow is one of the most common orchestration engines for AI/Machine Learning jobs, especially for retrieval-augmented generation (RAG). This project shows a simple example of building vector embeddings for text and then performing a semantic search on the embeddings.
The DAG (directed acyclic graph) in the project demonstrates how to leverage Airflow's automation and orchestration capabilities to:
- Orchestrate a generative AI pipeline.
- Compute vector embeddings of words using Python's
SentenceTransformers
library. - Compare the embeddings of a word of interest to a list of words to find the semantically closest match.
You'll write a user-customizable generative-AI pipeline in easy-to-read Python code!
This project uses DuckDB, an in-memory database, for running dbt transformations. Although this type of database is great for learning Airflow, your data is not guaranteed to persist between executions!
For production applications, use a persistent database instead (consider DuckDB's hosted option MotherDuck or another database like Postgres, MySQL, or Snowflake).
Pipeline structure
An Airflow project can have any number of DAGs (directed acyclic graphs), the main building blocks of Airflow pipelines. This project has one:
example_vector_embeddings
This DAG contains six tasks:
-
get_words
gets a list of words from the context to embed. -
create_embeddings
creates embeddings for the list of words. -
create_vector_table
creates a table in the DuckDB database and an HNSW index on the embedding vector. -
insert_words_into_db
inserts the words and embeddings into the table. -
embed_word
embeds a single word and returns the embeddings. -
find_closest_word_match
finds the closest match to a word of interest.
Step 4: Get your hands dirty!
With Airflow, it's easy to test and compare LMs until you find the right model for your generative AI workflows. In this step, you'll learn how to:
- Configure a DAG to use different LMs.
- Use the Airflow UI to compare the performance of the models you select.
Experiment with different LMs to compare performance
Sentence Transformers (AKA SBERT) is a popular Python module for accessing, using, and training text and image embedding models. It enables a wide range of AI applications, including semantic search, semantic textual similarity, and paraphrase mining. SBERT provides various pre-trained language models via the Sentence Transformers Hugging Face organization. Additionally, over 6,000 community Sentence Transformers models have been publicly released on the Hugging Face Hub.
Try using a different language model from among those provided by SBERT in this project's DAG. Then, explore the metadata in the Airflow UI to compare the performance of the models.
-
Start your experiment by using a different model. Find the
_LM
variable definition in theget_embeddings_one_word
function close to the top of theexample_vector_embeddings
DAG and replace the model string withdistiluse-base-multilingual-cased-v2
:_LM = os.getenv("LM", "distiluse-base-multilingual-cased-v2")
The default is very fast, but this one is slower and lower-performing overall, so the results should be noticeably different. You could also try a model with higher overall performance, such as
all-mpnet-base-v2
. For a list of possible models to choose from, see SBERT's Pretrained models list. -
Next, find the dimensions of the model in the SBERT docs.
For example, the
distiluse-base-multilingual-cased-v2
model has dimensions of 512: -
Use this number to redefine another top-level variable,
_LM_DIMENSIONS
:_LM_DIMENSIONS = os.getenv("LM_DIMS", "512")
This value is used in the vector column type definition in the
create_vector_table
task and the select query in thefind_closest_word_match
task. -
Rerun the DAG. Depending on the models you choose, you might see large differences in the performance of the
create_embeddings
task.For example, using the default
all-MiniLM-L6-v2
model should result in runtimes of around 4s:By contrast, using the
distiluse-base-multilingual-cased-v2
model might result in runtimes three times as long or longer: -
Check the log output from the
find_closest_word_match
task and look for differences between the search result sets.For example, the faster LM
all-MiniML-L6-v2
returnssun, planet, light
:The more performant LM
all-mpnet-base-v2
returnssun, rocket, planet
:
For more information about the SBERT project, library, and models, see the Sentence Transformers Documentation.
Next Steps: Run Airflow on Astro
The easiest way to run Airflow in production is with Astro. To get started, create an Astro trial. During your trial signup, you will have the option of choosing the same template project you worked with in this quickstart.