Dbt is a great tool for data transformations. The paid version, called dbt Cloud, offers Slim CI. Slim CI makes it easy to test new and modified dbt models before deploying them into production. You can configure jobs to run when opening a Pull Request in your repository. By surfacing the job’s most recent successful run artifacts, dbt Cloud determines the set of new and modified resources. You can choose to only run the dbt models for which any code has changed. This makes your CI process more efficient in terms of time and costs. In this article I will explain how to mimic this functionality when you are limited to using dbt Core.
Summarized, the process is as follows:
Run dbt when deploying to production and save themanifest.jsonfile to cloud storage. This file contains thestate of the modelsof the last dbt run, and thus reflects the models deployed to production.
Retrieve themanifest.jsonfile from cloud storage when opening a newpull requestand run dbt using thestate:modified+parameter in that same pull request.
Slim CI workflow using dbt Core and Google Cloud Platform
Storing the manifest.json file
First, you need to create a cloud storage folder, which you can do through the UI, but preferably through Terraform. I am using Google Cloud Storage (GCS), but you can use any cloud storage. Themanifest.jsonfile will be copied to this folder, and overwritten each time a new deployment is made. I named my GCS bucketmanifest-location-prod, and can access it atgs://manifest-location-prodwhen logging in to that Google Cloud project during the CI process. I was already generating themanifest.jsonfile in a workflow where we publish our dbt documentation to Github Pages, so I incorporate the copying of themanifest.jsonfile to the GCS bucket in that workflow. Find a full example of this Github workflow inthisrepository, or see the relevant snippet below.
Note: you first need to run dbt before generating the documentation, because this ensures that you see documentation for all columns, and not just the columns described in your dbt yml files.
Part of the Github workflow that generates the dbt docus and stores it in Google Cloud Storage
When opening a pull request we use that samemanifest.jsonfile to get thecurrent state of the deployed models. After retrieving the file from GCS, the new or modified dbt models are run and tested. The parameterstate:modified+is used to run these models and its first order children, which are the models that are downstream of the modified models.
The workflow that runs and tests dbt models when opening a Pull Request
Limit the data
Testing these new models during Pull Requests against full production datasets would be expensive. Limit the amount of data against which the models are tested by usingdbt macros. This allows you to test models on a subset of the production data. The macro indays_back.sqluses a date column as input. The objective is to limit the data, x number of days back from today.
A macro that filters data on the past x days
In our staging model stg_web.sql, we call the macro with the arguments that we want data from the last 3 days, and specify the column from which to extract the dates. Using 3 as an input for days, date_sub(current_date("Europe/Amsterdam"), interval day) from days_back.sql will return the date from 3 days ago. The macro then filters on the dates that are greater or equal than the returned date. Do note that your source table has to be partitioned on the date column to optimize the query for cost.
Staging model that implements the days back macro
This is all conditional on the name of the target, because we do not want to filter the data when targeting “prod”. In the Pull Request workflow “dev” is being targeted, which means that the days_back macro is only used when integrating new dbt models.
Testing and running new or modified dbt models is important, but the costs and time that it takes can get out of hand very quickly. By mimicking the Slim CI feature of dbt Cloud and filtering the data against which the dbt models are run, we can make our CI process more (cost) efficient.
Thanks for reading! Feel free to reach out if you found this post interesting or if it helped you out in any way.