This directory stores the configuration for building our data catalog using dbt.
- 🖼️ Background: What does the data catalog do?
- 💻 How to develop the catalog
- ➕ How to add a new model
- 🔨 How to rebuild models using GitHub Actions
- 🧪 How to add and run tests and QC reports
- 🐛 Debugging tips
- 📖 Data documentation
- 📝 Design doc for our decision to develop our catalog with dbt
- 🧪 Generic tests we use for testing
The data catalog accomplishes a few main goals:
The models defined in models/
represent the SQL operations that
we run in order to transform raw data into output data for use in statistical
modeling and reporting. These transformations comprise a DAG that we use for
documenting our data and rebuilding it efficiently. We use the dbt run
command to build these
models into views and tables in AWS Athena.
Note
When we talk about "models" in these docs, we generally mean the resources that dbt calls "models", namely the definitions of the tables and views in our Athena warehouse. In contrast, we will use the phrase "statistical models" wherever we mean to discuss the algorithms that we use to predict property values.
The tests defined in the schema.yml
files in the models/
directory
set the specification for our source data and its transformations. These
specs allow us to build confidence in the
integrity of the data that we use and publish. We use the
dbt test
command to run these
tests against our Athena warehouse.
The DAG definition is parsed by dbt and used to build our data
documentation. We use the
dbt docs generate
command to generate these docs.
The workflows, actions, and scripts defined in the .github/
directory work together to perform all of our dbt operations
automatically, and to integrate with the development cycle such that new
commits to the main branch of this repository automatically deploy changes to
our tables, views, tests, and docs. Automated tasks include:
- (Re)building any models that have been added or modified since the last commit
to the main branch (the
build-and-test-dbt
workflow) - Running tests for any models that have been added or modified since the last
commit, including tests that have themselves been added or modified (the
build-and-test-dbt
workflow) - Running tests for all models once per day (the
test-dbt-models
workflow) - Checking the freshness of our source
data once per day
(the
test-dbt-source-freshness
workflow) - Generating and deploying data documentation to our docs
site on every commit to the
main branch (the
deploy-dbt-docs
workflow) - Cleaning up temporary resources in our Athena warehouse whenever a pull
request is merged into the main branch (the
cleanup-dbt-resources
workflow)
These instructions are for Ubuntu, which is the only platform we've tested.
For background, see the docs on installing dbt with pip. (You don't need to follow these docs in order to install dbt; the steps that follow will take care of that for you.)
- Python3 with venv installed (
sudo apt install python3-venv
) - AWS CLI installed
locally
- You'll also need permissions for Athena, Glue, and S3
aws-mfa
installed locally
Run the following commands in this directory:
python3 -m venv venv
source venv/bin/activate
pip install -U pip
pip install -r requirements.txt
dbt deps
To run dbt commands, make sure you have the virtual environment activated:
source venv/bin/activate
You must also authenticate with AWS using MFA if you haven't already today:
aws-mfa
We use the dbt build
command to build tables and
views (called models and
seeds in dbt jargon) in our
Athena data warehouse. See the following sections for specific instructions
on how to build development and
production models.
When passed no arguments, dbt build
will default to building
all tables and views in development schemas dedicated to your user. The full
build takes about three hours and thirty minutes, so we don't recommend running
it from scratch.
Instead, start by copying the production dbt state file (also known as the manifest file):
aws s3 cp s3://ccao-dbt-cache-us-east-1/master-cache/manifest.json master-cache/manifest.json
Then, use dbt clone
to
clone the production tables and views into your development environment:
dbt clone --state master-cache
This will copy all production views and tables into a new set of Athena schemas
prefixed with your Unix $USER
name (e.g. z_dev_jecochr_default
for the
default
schema when dbt
is run on Jean's machine).
Once you've copied prod tables and views into your development schemas, you can rebuild specific tables and views using dbt's node selection syntax.
Use --select
to build one specific table/view, or a group of tables/views.
Here are some example commands that use --select
to build a subset of all
tables/views:
# This builds just the vw_pin_universe view
dbt build --select default.vw_pin_universe --resource-types model
# This builds vw_pin_universe as well as vw_pin10_location
dbt build --select default.vw_pin_universe location.vw_pin10_location --resource-types model
# This builds all models and seeds in the default schema
dbt build --select default.* --resource-types model seed
Note
If you are building a Python model,
your model may require external dependencies be available on S3.
To make these dependencies available to your model, run the
build-and-test-dbt
workflow on your branch to deploy any Python dependencies
that you've added to the config.packages
attribute
on your model.
By default, all dbt
commands will run against the dev
environment (called
a target in
dbt jargon), which namespaces the resources it creates by prefixing database
names with z_dev_
and your Unix $USER
name.
You should almost never have to manually build tables and views in our production environment, since this repository is configured to automatically deploy production models using GitHub Actions for continuous integration. However, in the rare case that you need to manually build models in production, see 🔨 How to rebuild models using GitHub Actions.
If you'd like to remove your development resources to keep our Athena data sources tidy, you have two options: Delete all of your development Athena databases, or delete a selection of Athena databases.
To delete all the resources in your local environment (i.e. every Athena
database with a name matching the pattern z_dev_$USER_$SCHEMA
):
../.github/scripts/cleanup_dbt_resources.sh dev
To instead delete a selected database, use the aws glue delete-database
command:
aws glue delete-database z_dev_jecochr_default
Note that these two operations will only delete Athena databases, and will leave
intact any parquet files that your queries created in S3. If you would like to
remove those files as well, delete them in the S3 console or using the aws s3 rm
command.
Run tests for all models:
dbt test
Run all tests for one model:
dbt test --select default.vw_pin_universe
Run only one test:
dbt test --select default_vw_pin_universe_unique_by_14_digit_pin_and_year
Run a test against the prod models:
dbt test --select default_vw_pin_universe_unique_by_14_digit_pin_and_year --target prod
Run tests for dbt macros:
dbt run-operation test_all
Note that we configure dbt's asset-paths
attribute in
order to link to images in our documentation. Some of those images, like the
Mermaid diagram defined in assets/dataflow-diagram.md
, are generated
automatically during the deploy-dbt-docs
deployment workflow. To generate
them locally, make sure you have
mermaid-cli
installed (we
recommend a local
installation) and
run the following command:
for file in assets/*.mmd; do
./node_modules/.bin/mmdc -i "$file" -o "${file/.mmd/.svg}"
done
Then, generate the documentation:
dbt docs generate
This will create a set of static files in the target/
subdirectory that can
be used to serve the docs site.
To serve the docs locally:
dbt docs serve
Then, navigate to http://localhost:8080 to view the site.
To request the addition of a new model, open an issue using the Add a new dbt model issue template. The assignee should follow the checklist in the body of the issue in order to add the model to the DAG.
There are a few subtleties to consider when requesting a new model, outlined below.
We default to SQL models, since they are simple and well-supported, but in some cases we make use of Python models instead. Prefer a Python model if all of the following conditions are true:
- The model requires complex transformations that are simpler to express using pandas than using SQL
- The model only depends on (i.e. joins to) other models materialized as tables, and does not depend on any models materialized as views
- The model's pandas code only imports third-party packages that are either
preinstalled in the Athena PySpark
environment
or that are pure Python (i.e. that do not include any C extensions or code in
other languages)
- The most common packages that we need that are not pure Python are
geospatial analysis packages like
geopandas
- The most common packages that we need that are not pure Python are
geospatial analysis packages like
If your Python model needs to use a third-party pure Python package that is not preinstalled in the Athena PySpark environment, you can configure the dependency to be automatically deployed to our S3 bucket that stores PySpark dependencies as part of the dbt build workflow on GitHub Actions. Follow these steps to include your dependency:
- Update the
config.packages
array on your model definition in your model'sschema.yml
file to add elements for each of the packages you want to install- Make sure to provide a specific version for each package so that our builds are deterministic
- Unlike a typical
pip install
call, the dependency resolver will not automatically install your dependency's dependencies, so check the dependency's documentation to see if you need to manually specify any other dependencies in order for your dependency to work
# Example -- replace `model.name` with your model name, `dependency_name` with
# your dependency name, and `X.Y.Z` with the version of the dependency you want
# to install
models:
- name: database_name.table_name
config:
packages:
- "dependency_name==X.Y.Z"
- Add an
sc.addPyFile
call to the top of the Python code that represents your model's query definition so that PySpark will make the dependency available in the context of your code
# Example -- replace `dependency_name` with your dependency name and `X.Y.Z`
# with the version of the dependency you want to import
# type: ignore
sc.addPyFile( # noqa: F821
"s3://ccao-athena-dependencies-us-east-1/dependency_name==X.Y.Z.zip"
)
- Call
import dependency_name
as normal in your script to make use of the dependency
# Example -- replace `dependency_name` with your dependency name
import dependency_name
See the reporting.ratio_stats
model for an example of this type of
configuration.
There are a number of different ways of materializing tables in Athena using dbt; see the dbt docs for more detail.
So far our DAG only uses view and table materialization, although we are interested in eventually incorporating incremental materialization as well.
The choice between view and table materialization depends on the runtime and downstream consumption of the model query. There is no hard and fast rule, but as a general guideline, consider materializing a model as a table if queries using the table take longer than 30s to execute, or if the model is consumed by another model that is itself computationally intensive.
Our DAG is configured to materialize models as views by default, so extra configuration is only required for non-view materialization.
Models should be namespaced according to the database that the model lives in
(e.g. location.tax
for the tax
table in the location
database.) Since
dbt does not yet support namespacing for refs, we include the database as a
prefix in the model name to simulate real namespacing, and we override the
generate_alias_name
macro to strip out this fake namespace when generating
names of tables and views in Athena
(docs).
In addition to database namespacing, views should be named with a vw_
prefix
(e.g. location.vw_pin10_location
) to mark them as a view, while tables do not
require any prefix (e.g. location.tax
).
Finally, for the sake of consistency and ease of interpretation, all tables and views should be named using the singular case e.g. location.tax
rather than location.taxes
.
Models are generally defined in the schema.yml
file within each database
subdirectory. Resources related to each model should be defined inline (with
the exception of columns, see Column descriptions.
For complicated models with many columns or tests, we split schema.yml
files into individual files per model. These files should be contained in a
schema/
subdirectory within each database directory, and should be named
using the fully namespaced model name. For example, the model definition
for iasworld.sales
lives in models/iasworld/schema/iasworld.sales.yml
.
All new models should include, at minimum, a
description
of the model itself. We store these model-level descriptions as docs
blocks
within the docs.md
file of each schema subdirectory.
Descriptions related to models in a schema/
subdirectory should still live
in docs.md
. For example, the description for default.vw_pin_universe
lives
in models/default/docs.md
.
New models should also include descriptions for each column. Since the first few characters of a column description will be shown in the documentation in a dedicated column on the "Columns" table, column descriptions should always start with a sentence that is short and simple. This allows docs readers to scan the "Columns" table and understand what the column represents at a high level.
Column descriptions can live in three separate places with the following hierarchy:
models/shared_columns.md
- Definitions shared across all databases and modelsmodels/$DATABASE/columns.md
- Definitions shared across a single databasemodels/$DATABASE/schema.yml
ORmodels/$DATABASE/schema/$DATABASE-$MODEL.yml
- Definitions specific to a single model
We use the following pattern to determine where to define each column description:
- If a description is shared by three or more resources across multiple
databases, its text should be defined as a docs block in
models/shared_columns.md
. The docs block identifier for each column should have ashared_column_
prefix. - If a description is shared by three or more resources across multiple
models in the same database, its text should be defined as a
docs block in
models/$DATABASE/columns.md
. The docs block identifier for each column should have acolumn_
prefix. - If a description is shared between two or fewer columns, its text should
be defined inline in the
schema.yml
file under thedescription
key for the column.
New models should generally be added with accompanying tests to ensure the underlying data and transformations are correct. For more information on testing, see 🧪 How to add and run tests and QC reports.
GitHub Actions can be used to manually rebuild part or all of our dbt DAG. To use this functionality:
- Go to the
build-and-test-dbt
workflow page - Click the Run workflow dropdown on the right-hand side of the screen
- Populate the input box following the instructions below
- Click Run workflow, then click the created workflow run to view progress
The workflow input box expects a space-separated list of dbt model names or selectors.
Multiple models can be passed at the same time, as the input box values are
passed directly to dbt build
. Model names must include the database schema name. Some possible inputs include:
default.vw_pin_sale
- Rebuild a single viewdefault.vw_pin_sale default.vw_pin_universe
- Rebuild two views at once+default.vw_pin_history
- Rebuild a view and all its upstream dependencieslocation.*
- Rebuild all views under thelocation
schemapath:models
- Rebuild the full DAG (:warning: takes a long time!)
For more possible inputs using dbt node selection, see the documentation site.
We test the integrity of our raw data and our transformations using a few different types of tests and reports, described below.
There are three types of products that we use to check the integrity of our data and the transformations we apply on top of that data:
- Data tests check that hard-and-fast assumptions about our
raw data are correct. These tests correspond to dbt data
tests.
- For example: Test that a table is unique by
parid
andtaxyr
.
- For example: Test that a table is unique by
- Unit tests check that transformation logic inside a model
definition produces the correct output on a specific set of input data.
These tests correspond to dbt unit
tests.
- For example: Test that an enum column computed by a
CASE... WHEN
expression in a view produces the correct output for a given input string.
- For example: Test that an enum column computed by a
- QC reports check for suspicious cases that might indicate
a problem with our data, but that can't be confirmed automatically. We
implement these reports using dbt
models.
- For example: Query for all parcels whose market value increased by more than $500k in the last year.
The following sections describe how to add and run each of these types of products.
We implement data tests using dbt tests
to check that hard-and-fast assumptions about our raw data are correct. We prefer adding tests
inline in schema.yml
config files using generic
tests,
rather than singular
tests.
Currently, our primary use of data tests is to check assumptions about iasWorld data. We refer to this set of tests as "iasWorld data tests", and we've built a system for running and interpreting them that we will explain in the sections to follow. Other types of data tests do exist, and we primarily run them via automated GitHub workflows during CI when models change. However, we anticipate that in the future we will likely build out similar infrastructure for running and interpreting non-iasWorld data tests to accompany the infrastructure we have built for iasWorld data tests.
The iasWorld data test suite can be run using the dbt test
command with a dedicated
selector
and the --store-failures
flag,
and its output can be transformed for review and analysis using the
transform_dbt_test_results
script.
This script reads the metadata for the most recent dbt test
run and outputs a number of
different artifacts with information about the tests:
- An Excel workbook with detailed information on each failure to aid in resolving data problems
- Parquet files representing metadata tables that can be uploaded to S3 for aggregate analysis
There are two instances when iasWorld data tests typically run:
- Once per day by the
test-dbt-models
GitHub workflow, which pushes Parquet output to S3 in order to support our analysis of test failures over time - On demand by a Data team member whenever a Valuations staff member requests a copy of the Excel workbook for a township, usually right before the town closes
Since the first instance is a scheduled job that requires no intervention, the following steps describe how to respond to a request from Valuations staff for a fresh copy of the test failure output before town closing.
Typically, Valuations staff will ask for test output for a specific township. We'll refer to the
township code for this township
using the bash variable $TOWNSHIP_CODE
.
First, run the tests locally using dbt and the iasWorld data test selector:
# Make sure you're in the dbt subdirectory with the virtualenv activated
cd dbt
source venv/bin/activate
# Run the tests and store failures in Athena
dbt test --selector qc_tests --store-failures
Next, transform the results for the township that Valuations staff requested:
python3 scripts/transform_dbt_test_results.py --township $TOWNSHIP_CODE
Finally, spot check the Excel workbook that the script produced to make sure it's formatted correctly, and send it to Valuations staff for review.
There are a few specific modifications a test author needs to make to ensure that a new iasWorld data test can be run by the workflow and interpreted by the script:
- One of either the test or the model that the test is defined on must be
tagged with
the tag
test_qc_iasworld
- Prefer tagging the model, and fall back to tagging the test if for some reason the model cannot be tagged (e.g. if it has some non-QC tests defined on it)
- If you would like to disable a data test but you don't want to remove it
altogether, you can tag it or its model with
test_qc_exclude_from_workbook
, which will prevent the test (or all of the model's tests, if you tagged the model) from running as part of theqc_tests
selector
- The test definition must supply a few specific parameters:
name
must be set and follow the patterniasworld_<table_name>_<test_description>
additional_select_columns
must be set to an array of strings representing any extra columns that need to be output by the test for display in the workbook- Generics typically select any columns mentioned by other parameters,
but if you are unsure which columns will be selected by default
(meaning they do not need to be included in
additional_select_columns
), consult our documentation for the generic test you're using
- Generics typically select any columns mentioned by other parameters,
but if you are unsure which columns will be selected by default
(meaning they do not need to be included in
config.where
should typically set to provide a filter expression that restricts tests to unique rows and to rows matching a date range set by thetest_qc_year_start
andtest_qc_year_end
project variablesmeta
should be set with a few specific string attributes:description
(required): A short human-readable description of the testcategory
(optional): A workbook category for the test, required if a category is not defined for the test's generic in theTEST_CATEGORIES
constant in thetransform_dbt_test_results
scripttable_name
(optional): The name of the table to report in the output workbook, if the workbook should report a different table name than the name of the model that the test is defined on
See the iasworld_pardat_class_in_ccao_class_dict
test
for an example of a test that sets these attributes.
Due to the similarity of parameters defined on iasWorld data tests, we make extensive use of YAML anchors and aliases to define symbols for commonly-used values. See here for a brief explanation of the YAML anchor and alias syntax.
Writing a data test in a schema.yml
file requires a generic
test
to define the underlying test logic. Our generic tests are defined
in the tests/generic/
directory. Before writing a test, look at
the documentation for our generics to see if
any of them meet your needs.
If a generic test does not meet your needs but seems like it could be easily extended to meet your needs (say, if it inner joins two tables but you would like to be able to configure it to left join those tables instead) you can modify the macro that defines the generic test as part of your PR to make the change that you need.
If no generic tests meet your needs and none can be easily modified to do so, you have two options:
- Define a new model in the
models/qc/
directory that can use a pre-existing generic. This is a good option if, say, you need to join two or more tables in a complex way that is specific to your test and not easily generalizable. With this approach, you can perform that join in the model, and then the generic test doesn't need to know anything about it. - Write a new generic test. If you decide to take this approach,
make sure to read the docs on writing custom generic
tests.
This is a good option if you think that the logic you need
for your test will be easily generalizable to other models
and other tests. You'll also need to follow a few extra steps that are specific
to our environment:
- Add a default category for your generic test in
the
TEST_CATEGORIES
constant in thetransform_dbt_test_results
script - Make sure that your generic test supports the
additional_select_columns
parameter that most of our generic tests support, making use of theformat_additional_select_columns
macro to format the parameter when applying it to yourSELECT
condition
- Add a default category for your generic test in
the
Unit tests help ensure that the transformations we apply on top of our raw data do not introduce errors. Unit testing is available in dbt as of the 1.8 release, but there is a bug that prevents it from working with the schema alias system that we use to namespace our models, so we do not yet have a process for adding or running unit tests. Jean is leading the effort to contribute to dbt Core in order to support unit tests in projects that follow our schema alias system, so she will update this section with documentation once that effort is resolved.
QC reports help us investigate suspicious data that might indicate a problem, but
that can't be confirmed automatically. We implement QC reports using dedicated
dbt models that are configured with attributes that can be parsed by the
export_models
script.
We run QC reports when Valuations staff ask for them, which most often occurs before a major event in the Valuations calendar like the close of a township.
The export_models
script
exposes a few options that help to export the right data:
--select
: This option controls which models the script will export. This option is equivalent to the dbt--select
option, and any valid dbt--select
expression will work for this option.--where
: This option controls which rows the script will return for the selected model in a similar fashion as a SQLWHERE
clause. Any expression that could follow aWHERE
keyword in a SQL filter condition will work for this option.--rebuild
or--no-rebuild
: This flag determines whether or not the script will rebuild the selected models usingdbt run
prior to export. It defaults to false (--no-rebuild
) and is most useful in rare cases where the underlying models that comprise the reports have been edited since the last run, typically during the period when a QC report is under active development.
We tag the models that comprise our town close QC reports using the qc_report_town_close
tag, and we filter them for a specific township code (like "70") and tax year during export.
Here's an example of how to export those models for a township code defined by $TOWNSHIP_CODE
and a tax year defined by $TAXYR
:
python3 scripts/export_models.py --select tag:qc_report_town_close --where "taxyr = '$TAXYR' and township_code = '$TOWNSHIP_CODE'"
The script will output the reports to the dbt/export/output/
directory, and will print the
names of the reports that it exports during execution.
We define the AHSAP change in value QC report using one model, qc.vw_change_in_ahsap_values
,
which we filter for a specific township name (like "Hyde Park") and tax year during export.
Here's an example of how to export that model for a township name defined by $TOWNSHIP_NAME
and a tax year defined by $TAXYR
:
python3 scripts/export_models.py --select qc.vw_change_in_ahsap_values --where "taxyr = '$TAXYR' and township_name = '$TOWNSHIP_NAME'"
The script will output the reports to the dbt/export/output/
directory, and will print the
name of the report that it exports during execution.
Since QC reports are built on top of models, adding a new QC report can be as simple
as adding a new model and exporting it using the export_models
script.
You should default to adding your model to the qc
schema and subdirectory, unless there is
a good reason to define it elsewhere. For details on how to add a model, see
➕ How to add a new model.
There are a number of configuration options that allow you to control the format of your model during export:
- Tagging: Model tags are not
required for QC reports, but they are helpful in cases where we need to export more than one
report at a time. For example, Valuations typically requests all of the town close QC
reports at the same time, so we tag each model with the
qc_report_town_close
tag such that we can select them all at once when running theexport_models
script using--select tag:qc_report_town_close
. For consistency, prefer tags that start with theqc_report_*
prefix, but beware not to use thetest_qc_*
prefix, which is instead used for QC tests. - Filtering: Since the
export_models
script can filter your model using the--where
option, you should define your model such that it selects any fields that you want to use for filtering in theSELECT
clause. It's common to filter reports bytaxyr
and one of eithertownship_name
ortownship_code
. - Formatting: You can set a few different optional configs on the
meta
attribute of your model's schema definition in order to control the format of the output workbook:meta.export_name
: The base name that the script will use for the output file, not including the file extension. The script will output the file todbt/export/output/{meta.export_name}.xlsx
. If unset, defaults to the name of the model.meta.export_template
: The base name for an Excel file that the script will use as a template to populate with data, not including the file extension. The script will read this file fromdbt/export/templates/{meta.export_template}.xlsx
. Templates are useful if you want to apply custom headers, column widths, or other column formatting to the output that are not otherwise configurable by themeta.export_format
config attribute described below. If unset, the script will search for a template with the same name as the model; if it does not find a template, it will default to a simple layout with filterable columns and striped rows.meta.export_format
: An object with the following schema that controls the format of the output workbook:columns
(required): A list of one or more columns to format, each of which should be an object with the following schema:index
(required): The letter index of the column to be formatted, likeA
orAB
.name
(optional): The name of the column as it appears in the header of the workbook. The script does not use this attribute and instead usesindex
, but we set it in order to make the column config object more readable.horizontal_align
(optional): The horizontal alignment to set on the column, one ofleft
orright
.
Here's an example of a model schema definition that sets all of the different optional and required formatting options for a new QC report:
models:
- name: qc.vw_qc_report_new
description: '{{ doc("view_vw_qc_report_new") }}'
config:
tags:
- qc_report_new
meta:
export_name: QC Report (New)
export_template: qc_report_new.xslx
export_format:
columns:
- index: B
name: Class
horizontal_align: left
In the case of this model, the export_models
script:
- Will export the model if either
--select qc.vw_qc_report_new
or--select tag:qc_report_new
is set - Will use the template
dbt/export/templates/qc_report_new.xlsx
to populate data - Will export the output workbook to
dbt/export/output/QC Report (New).xlsx
- Will left-align column B, a column with the name
Class
Most of our dbt tests are simple SQL statements that we run against our models in order to confirm that models conform to spec. If a test is failing, you can run or edit the underlying query in order to investigate the failure and determine whether the root cause is a code change we made, new data that was pushed to the system of record, or a misunderstanding about the data specification.
To edit or run the query underlying a test, first run the test in isolation:
dbt test --select <test_name>
Then, navigate to the Recent
queries
tab in Athena. Your test will likely be one of the most recent queries; it
will also start with the string -- /* {"app": "dbt", ...
, which can be
helpful for spotting it in the list of recent queries.
Open the query in the Athena query editor, and edit or run it as necessary to debug the test failure.
To quickly rule out a failure related to a code change, you can switch to the
main branch of this repository (or to an earlier commit where we know tests
passed, if tests are failing on the main branch) and rerun the test against prod
using the --target prod
option. If the test continues to fail in the same
fashion, then we can be confident that the root cause is the data and not the
code change.
The cleanup-dbt-resources
workflow removes all AWS resources that were created
by GitHub Actions for a pull request once that pull request has been merged into
the main branch of the repository. On rare occasions, this workflow might fail
due to changes to our dbt setup that invalidate the assumptions of the workflow.
There are two ways to clean up a PR's resources manually:
- Using the AWS console: Login to the AWS console and navigate to the
Athena homepage. Select
Data sources
in the sidebar. Click on theAwsDataCatalog
resource. In theAssociated databases
table, select each data source that matches the database pattern for your pull request (i.e. prefixed withz_ci_
plus the name of your branch) and click theDelete
button in the top right-hand corner of the table. - Using the command-line: If the workflow has failed, it most likely means
there is a bug in the
.github/scripts/cleanup_dbt_resources.sh
script (source code). Once you've identified and fixed the bug, confirm it works by running the following command to clean up the resources created by the pull request:
HEAD_REF=$(git branch --show-current) ../.github/scripts/cleanup_dbt_resources.sh ci
If you get this error:
Compilation Error
dbt found two schema.yml entries for the same resource named location.vw_pin10_location. Resources and their associated columns may only be described a single time. To fix this, remove one of the resource entries for location.vw_pin10_location in this file:
- models/location/schema.yml
It usually means that dbt's state has unresolvable conflicts with the current
state of your working directory. To resolve this, run dbt clean
to clear your
dbt state, reinstall dbt dependencies with dbt deps
, and then try rerunning
the command that raised the error.
When attempting to build models, you may occasionally run into the following error indicating that a model that your selected model depends on does not exist:
Runtime Error in model default.vw_card_res_char (models/default/default.vw_card_res_char.sql)
line 24:10: Table 'awsdatacatalog.z_dev_jecochr_default.vw_pin_land' does not exist
The error may look like this if an entire schema is missing:
Runtime Error in model default.vw_pin_universe (models/default/default.vw_pin_universe.sql)
line 130:11: Schema 'z_dev_jecochr_location' does not exist
To resolve this error, you can prefix your selected model's names with a plus
sign (+
) to instruct dbt to (re)build its dependency models as well. However,
note that this will rebuild all dependency models, even ones that already
exist in your development environment, so if your model depends on another model
that is compute-intensive (basically, anything in the location
or proximity
schemas) you should use the --exclude
option to exclude
these compute-intensive models from being rebuilt:
dbt build --select +model.vw_pin_shared_input --exclude location.* proximity.* --resource-types model seed
If you'd like to know what dbt is doing under the hood, you can use the --log-level
parameter to enable debug
logging when running dbt commands:
dbt --log-level debug build --select model.vw_pin_shared_input