Hybrid Search for eCommerce reference architecture

Info

This page has not yet been updated for Airflow 3. The concepts shown are relevant, but some code may need to be updated. If you run any examples, take care to update import statements and watch for any other breaking changes.

The Hybrid Search for eCommerce GitHub repository is a free and open-source reference architecture showing how to use Apache Airflow® with Weaviate to build an automated hybrid search application. A demo of the architecture was shown in the Modern Infrastructure for World Class AI Applications webinar.

Screenshot of the Hybrid Search application frontend.

This reference architecture demonstrates how to use Apache Airflow to orchestrate RAG data ingestion that powers a search application as well as a batch inference pipeline analyzing search queries. It also shows how to use Weaviate’s advanced search capabilities. You can adapt the Hybrid Search application to your use case by ingesting your own data and adjusting the search queries in the website backend to fit your needs.

Architecture

Hybrid search reference architecture diagram.

The hybrid search reference architecture consists of 3 main components:

Data ingestion and embedding: Sample data containing product descriptions and images is ingested from Amazon S3 and Snowflake into Weaviate, a vector database. Embedding of the product descriptions uses OpenAI models.
Hybrid search: The demo website with a Flask backend and React frontend allows users to experiment with advanced Weaviate search by querying the product descriptions using hybrid search. An OpenAI embedding model is used to embed the user query.
Batch inference: All user search queries are stored back in Weaviate so they can be used by a downstream Airflow DAG that runs an OpenAI batch inference pipeline to classify user queries and derive product insights. The results of this analysis are loaded into Snowflake to be displayed in a Streamlit dashboard.

Airflow features

The DAGs that power this hybrid search application highlight several key Airflow best practices and features:

Airflow retries: To protect against transient API failures and rate limits, all tasks are configured to automatically retry after an adjustable delay.
Advanced data-driven scheduling: The DAGs in this reference architecture run on data-driven schedules, including combined asset and time scheduling and conditional asset scheduling.
Dynamic task mapping: Product information extraction and ingestion into Weaviate are split into multiple parallelized tasks, the number of which is determined at runtime based on the number of ingestion folders with product information that needs to be processed.
Object Storage: Interaction with files in object storage is simplified using the experimental Airflow Object Storage API.
Modularization: Functions defining how information is extracted and checksums are calculated are modularized in the include folder and imported into the DAGs. This makes the DAG code more readable and offers the ability to reuse functions across multiple DAGs.

Next Steps

Get the Astronomer GenAI cookbook to view more examples of how to use Airflow to build generative AI applications.

If you’d like to build your own hybrid search application, feel free to fork the repository and adapt it to your use case. We recommend to deploy the Airflow pipelines using a free trial of Astro.