How we optimized the Registry for performance across millions of page views
Part of the appeal of Airflow is its rich ecosystem - if you’re working with common data tools and products, chances are there’s already functionality for interacting with those tools and products in Airflow via the provider system. There are massive benefits to an ecosystem like this, but there are also natural challenges. When there are over 1,000 operators available to you, it’s difficult to know which operator to use and whether or not a given operator has the functionality you require. As a data engineer, when you want to find a new operator for your use case, you’re stuck Googling and searching GitHub; and when you do find one, to understand it, you’re typically stuck reading the operator’s source code to understand what it does and how to configure it.
In 2021, we first released the Astronomer Registry to help solve these challenges by making Airflow providers, operators, and example DAGs easier to discover and use. If you’re curious, you can learn more about the initial release in the original blog post.
Building the Registry was no small undertaking: we have to programmatically parse Airflow operator source code across many repositories and aggregate it in a way that is consistent, searchable, and easy to use. Early on, we accepted that there were likely to be data issues across the board given this was the first attempt at standardizing this info, so we wanted an architecture that would let us iterate quickly and patch data issues when we noticed them. Ultimately, we landed on the following:
- A set of Airflow DAGs that look for new releases of Airflow provider packages. Once new releases are identified, the code is parsed and data is submitted to Airtable. These run on (you guessed it!) Astro.
- Three tables in Airtable: one for Airflow providers, one for modules (operators) within those providers, and one for example DAGs.
- A UI written in React & TypeScript, hosted on Netlify.
Airtable was crucial here - as we discovered data quality issues with parameter descriptions, types, and names, we could immediately fix the issue with Airtable’s UI while we wait for the docs fix to be merged and released in the provider package itself. This worked really well! We scaled to millions of page views with no issues and were able to react very quickly to reported issues.
The big challenge of this architecture is that it’s completely independent of everything else at Astronomer. This makes it tough to do new feature development and tough to bring the Registry closer to Astro since there’s no traditional API and engineers need to be onboarded to an unfamiliar tech stack. The data hosted in the Registry is very helpful and tying into the authentication infrastructure we have at Astronomer means we can provide more personalization, so we decided to rebuild it.
The stack is similar, but there are a few key differences. Now, we have:
- A set of Airflow DAGs that are more fault-tolerant than the previous DAGs and provide nice interfaces for adding new providers as users request them. Still hosted on Astro!
- A Postgres database with raw data tables and a handful of materialized views (more on that later)
- An API written in Golang and a UI written in React & TypeScript
One of the guiding principles of the new Registry was performance. Since it’s responsible for serving relatively static data, we wanted to make it work as quickly as possible. We designed the database with normalized tables to minimize redundant data and make the schema easy to understand and maintain, but it wasn’t as quick as we wanted. The design called for multiple joins per read request which affected the performance of our read operations. We didn’t want to change the schema, so we started looking for other options.
Materialized views are used pretty frequently in analytics applications, but they’re not incredibly common for more traditional applications. While the frequency of reads on the Registry is quite high, the frequency of writes is pretty low - they only happen when a user wants to publish a new provider, or a provider publishes a new release. Creating materialized views for the common read operations means those operations are as quick as possible. Given these advantages (and the fact that we’re working with relatively small datasets), we decided materialized views were the way to go.
With the new database design and a Golang-based API, the new Registry runs at an order of magnitude quicker: most pages are down from ~1 second on the old Registry to just a few hundred milliseconds on the new version!
We’re going to keep making improvements to the Registry and build new features on top of it. Stay tuned for more, and in the meantime, go check it out!