Standardizing your Astro projects with Cookiecutter and Cruft

  • Bas Harenslak

Large scale software development is performed by multiple people and multiple teams. To enable teams to work effectively and avoid a sprawl of different conventions, tools, and technologies within your organization, you probably want to set a few standards. In this article we will demonstrate how to generate and manage multiple Astro projects for onboarding teams quickly, whilst applying conventions and standards to the code in different Git repositories using Cookiecutter and Cruft. The examples demonstrate how to do this with Astro, but the ideas are applicable to any technology.

The Astronomer platform makes it is incredibly easy to create and manage Astro Deployments, which are single instances of Airflow. Separate Deployments enable a true multi-tenant Airflow service. A team generally works in their own Git repository from which they deploy to multiple Astro Deployments. Each Deployment represents a different development phase such as development or production. With this example, one team would deploy code to two Astro Deployments:

Deployment workflow for single team

With multiple teams, you then have multiple Git repositories to manage:

Multiple teams results in multiple Git repositories

While the Git repositories presumably all contain Airflow code and are thus alike, it is desirable to set best practices and standards for both existing and new teams. However, managing each repository manually would be a cumbersome task. So how do you manage multiple similar Git repositories in an automated fashion?

Setting up a template repository with Cookiecutter

Cookiecutter is a tool that enables you to define a "blueprint" for a project and generate new projects from that. Let's say you're working in a platform team and you want to deliver Astro projects with a few company-specific standards applied to it. Without Cookiecutter, you could develop one "golden standard" project, copy-paste that project for every team, and apply modifications accordingly. This is cumbersome and error-prone because you'd have to go into every project and make adjustments manually.

With Cookiecutter, you define placeholders on any level in a project. That means folder names, file names, and file content. The placeholders are then filled with the values that you supply when generating a new project.

To show how this works, we'll walk through an example Cookiecutter project that contains a standard Astro project with a few modifications. You can find the code for the project here. Refer to the Cookiecutter documentation for additional usage examples.

First, install Cookiecutter:

pip install cookiecutter

Let's generate a new project based on the template project by running the following command. This template requires one variable "team_name", but you can design your template using any number of variables with any name that's required for your use case:

$ cookiecutter git@github.com:astronomer/cookiecutter-astro.git
  [1/1] team_name (X): analytics

We're asked to provide a team name, lets say "analytics". This generates a project for team analytics (several files were obfuscated for readability):

.
└── astro_team_analytics
    ├── .github
    │   └── workflows
    │       └── astro_deploy.yaml
    ├── .pre-commit-config.yaml
    ├── Dockerfile
    ├── README.md
    ├── dags
    │   ├── .airflowignore
    │   ├── connection_variable_test.py
    │   └── hello_world.py
    ├── include
    ├── packages.txt
    ├── plugins
    ├── requirements.txt
    └── tests
        └── dags
            └── test_dag_integrity.py

This project is similar to a standard Astro project that you get by running astro dev init with the Astro CLI, but with customizations defined in the template cookiecutter-astro project. For example, there is a .pre-commit-config.yaml file for running checks before committing changes, two custom DAGs, and a folder .github which includes a GitHub Actions CI/CD pipeline. This means the analytics team now has a project ready to go and doesn't need to spend time tinkering around with CI/CD or development standards.

Cookiecutter enables stamping out projects quickly for new users to onboard onto the Astro platform quickly. This enables platform teams to quickly onboard new teams to the Astronomer platform:

Generating Git repositories using Cookiecutter

Maintaining generated repositories with Cruft

Generating projects from a template project gets new teams up and running quickly, which is great. However, this is where Cookiecutter's help stops. When you make an update to your Cookiecutter project, there is no functionality in Cookiecutter to synchronize that change to already generated projects. This is where Cruft comes in. Cruft enables you to synchronize changes made in a template Cookiecutter project to generated projects. Let's look at an example.

First, install Cruft:

pip install cruft

Instead of generating a project with Cookiecutter like in the previous section, now generate it using Cruft. This works similar to Cookiecutter, but using a command cruft create:

$ cruft create git@github.com:astronomer/cookiecutter-astro.git
  [1/1] team_name (X): analytics

This command generates a project for team "analytics" similar to the one generated with Cookiecutter in the previous section. However, there's one addition: in the generated project, there's now an additional file named .cruft.json. This file contains several important details:

{
  "template": "git@github.com:astronomer/cookiecutter-astro.git",
  "commit": "4ff6a8b495367d37d5b30673975dd03b6580606d",
  "context": {
    "cookiecutter": {
      "team_name": "analytics"
    }
  }
}

Several details were obfuscated for readability, but here you can see from which repository and specific commit the project was generated, and the values that were used for generating the project.

Now let's say your platform team decides to make a change in the template project for all development teams, for example bumping the Astro Runtime version in the Dockerfile from 10.0.0 to 10.1.0, adding a new test in the CI/CD, or adding a new Python dependency. How do you get that change into all of your generated projects? To do this for one project using Cruft, navigate to the generated project and run cruft check:

$ cruft check
FAILURE: Project's cruft is out of date! Run `cruft update` to clean this mess up.

Cruft uses the details from .cruft.json to check in the template repository for newer commits. A failure means there was an update and the generated project is out of sync with the template project. You can update your project using cruft update:

$ cruft update
Respond with "s" to intentionally skip the update while marking your project as up-to-date or respond with "v" to view the changes that will be applied.
Apply diff and update? (y, n, s, v) [y]: v

...

-FROM quay.io/astronomer/astro-runtime:10.0.0
+FROM quay.io/astronomer/astro-runtime:10.1.0

Respond with "s" to intentionally skip the update while marking your project as up-to-date or respond with "v" to view the changes that will be applied.
Apply diff and update? (y, n, s, v) [y]: y
Good work! Project's cruft has been updated and is as clean as possible!

With this command, Cruft inspects updates in the template repository and applies them to your project, so your generated project is now in sync again with your template project! In this example, we see the Astro Runtime version was updated from 10.0.0 to 10.1.0.

While this is definitely better than copying code changes manually from a template project, it's still cumbersome having to run cruft update for every repository. In some organizations, there are hundreds of teams and having to repeat cruft update hundreds of times is still a lot of work, so let's automate that.

Automatically applying template repository updates

The approach we'll take for automation is to create a pull request on the generated repository with changes from the template repository. Since changes sometimes require human checks to ensure everything is still working properly, and to ensure code changes are always reviewed, we opt for suggesting changes via a pull request instead of forcing changes by pushing directly to a repository. You're of course free to build this differently according to your preferred way of working.

For this example, we'll use GitHub Actions, but any CI/CD system will suffice. The following GitHub Actions YAML file should run in every generated project, so store this file in the template project to automatically provision it for every generated project. It runs once an hour and checks for updates in the template repository. We'll break it down step-by-step.

name: Check for updates in template repository
permissions:
  contents: write
  pull-requests: write
on:
  schedule:
    - cron: "0 * * * *" # Once an hour at 00:00
jobs:
  check-for-updates:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - uses: actions/setup-python@v4
        with:
          python-version: "3.10"

      - name: Install Cruft
        run: pip3 install cruft

      # Add SSH key to SSH agent for reading from Cookiecutter repo
      - name: Set SSH key
        uses: webfactory/ssh-agent@v0.8.0
        with:
          ssh-private-key: ${{ secrets.COOKIECUTTER_REPO_KEY }}

      - name: Check if update is available
        continue-on-error: false
        id: check
        run: |
          CHANGES=0
          if [ -f .cruft.json ]; then
            if ! cruft check; then
              CHANGES=1
            fi
          else
            echo "No .cruft.json file"
          fi

          echo "has_changes=$CHANGES" >> "$GITHUB_OUTPUT"

      - name: Run update if available
        if: steps.check.outputs.has_changes == '1'
        run: |
          git config --global user.email "you@example.com"
          git config --global user.name "GitHub"

          cruft update --skip-apply-ask --refresh-private-variables
          git restore --staged .

      - name: Create pull request
        if: steps.check.outputs.has_changes == '1'
        uses: peter-evans/create-pull-request@v5
        with:
          token: ${{ secrets.GITHUB_TOKEN }}
          add-paths: .
          commit-message: "Merge updates from template repository"
          branch: cruft/update
          delete-branch: true
          title: 👨‍🔧 Merge updates from template repository
          body: |
            This repository must be kept in sync with the template repository.

            This is an autogenerated PR. [Cruft](https://cruft.github.io/cruft/) has detected updates in the template repository.

To run this GitHub Actions workflow, you first need grant access to clone the template repository from a workflow running in the generated repository by setting up a deploy key. The secret to read the template repository was named COOKIECUTTER_REPO_KEY in the example. With that set up, the workflow runs once an hour and checks for updates using the following steps:

  • In the step "Check if update is available", it runs cruft check.
  • If an update is detected, in the step "Run update if available", it runs cruft update --skip-apply-ask --refresh-private-variables to update the code on the GitHub runner.
  • If an update is detected, in the step "Create pull request" it creates a pull request.

With this CI/CD set up, the pipeline will run once and hour and automatically create a pull request containing updates from the template repository if you made any. For example:

Generating Git repositories using Cookiecutter

The responsible team can then decide to accept the change, or whether further updates specific to their repository are required. This completes the architecture diagram:

Synchronizing template repository changes using Cruft

Conclusion

At Astronomer, we see different variations of this setup with our customers. While a platform team could be given more authority by forcefully pushing changes without a pull request from the template repository to the generated repositories, it requires the platform team to have a good understanding of the development team's repositories to avoid breaking things. This doesn't scale well to hundreds of development teams.

On the other hand, providing a free-for-all unrestricted development environment without any standards gives the most freedom to development teams to architect their project in whatever way they prefer. While this is desirable within some organizations, at a certain scale we often see the introduction of certain development standards.

The approach in this blog post doesn't force anything, but provides platform teams with a mechanism for introducing changes to a blueprint (the Cookiecutter template repository), which propagate to the development teams in the form of pull requests. The development teams are then given the choice to accept, change, or deny the pull request.

Cookiecutter is a great tool for creating a template of a project and generating new projects from that. However, Cookiecutter doesn't have a way to synchronize changes in the template project to the generated projects. That's where Cruft comes in. Cookiecutter and Cruft together form a powerful mechanism for organizations to set and maintain development standards. This blog post demonstrates a happy flow of Cookiecutter and Cruft. In practice, you will observe various challenges such as linking existing projects to the template project or merge conflicts when the same line of code is changed in both repositories. Cruft offers various commands for dealing with these situations, so we suggest reading the documentation for these tools:

And for reference, the example Astro Cookiecutter project: https://github.com/astronomer/cookiecutter-astro.

Build, run, & observe your data workflows.
All in one place.

Get $300 in free credits during your 14-day trial.