Generating Base Tables in dbt
With dbt, source tables are pulled into projects in staging models with very little to no transformations. They are simply used to clean and stage your data for use downstream in other models. This is where generic id columns can be renamed to something more descriptive and unnecessary columns can be removed. It is rare for joins to happen at this stage. As such, the work done to bring a source table into a staging model is typically uniform. Calling out each column in a source table that is needed can be tedious at best. But this is dbt so there is a way to use code to make things easier!
The package dbt-codegen has a macro for handling this exact thing, generate_base_model
. It has optional arguments for leading commas (vs trailing commas), case sensitivity, and materialization. This is a great starting place.
But each project has its own nuances. The default output is not going to be perfect and will still need modifying. Instead of modifying the output generated by generate_base_model
, it would be great if it already included these modifications we know we will need in every staging model we create. You could dig through the files installed when the package was loaded to find the macro. An easier way is to create your own macro in your macros folder with the same name. Running the command will grab your new version instead of the package default. This is where you can modify the macro to be specific to your project. The original source code for the generate_base_model
macro can be found here.
In our project, we are bringing tables over into our database with Estuary Flow. Estuary Flow adds columns with important data to the tables that it brings over, but we do not want to carry these columns through into our staging models. Our modified macro excludes these columns.
{%- set columns = adapter.get_columns_in_relation(source_relation) -%}
{%- set estuary_columns = ['_meta/op','flow_published_at','flow_document'] -%}
{%- set all_column_names=columns | map(attribute='name') -%}
{%- set base_model_sql -%}
{%- set column_names = [] -%}
{#- Removes columns that Esturay adds from the column list.
We do not need them after the staging files filter out the deleted columns -#}
{%- for col in all_column_names -%}
{%- if col not in estuary_columns -%}
{%- do column_names.append(col) -%}
{%- endif %}
{%- endfor %}
Our source tables include created_at
and updated_at
timestamps. While these columns are useful to keep in our staging models, they do not need to be displayed prominently. Estuary Flow does not maintain column order and these columns can end up being displayed in alphabetical order. We prefer that they show up as the last columns and be near each other.
{#- Moves created_at and updated_at to the end of the column list -#}
{%- if 'created_at' in column_names -%}
{%- do column_names.remove('created_at') -%}
{%- do column_names.append('created_at') -%}
{%- endif %}
{%- if 'updated_at' in column_names -%}
{%- do column_names.remove('updated_at') -%}
{%- do column_names.append('updated_at') -%}
{%- endif -%}
Estuary Flow also adds the _meta/op
column for filtering out deletions. We do not want to include any of these deletions in our staging models. Since we will be filtering these out, this will be one of the columns we want to exclude from being displayed.
FROM source
WHERE
"_meta/op" <> 'd'
Another modification made is to rename id columns to be more descriptive.
{%- for column in column_names %}
{%- if column == 'id' -%}
{# Takes the table name, strips the last letter to make it singular (e.g orders --> order)
and appends "_id" to create a primary key that matches the name of foreign keys #}
id AS {{ table_name[:-1] }}_id{{"," if not loop.last}}
After creating our own macro for generating base models, our output now looks like what we want in our staging models and doesn't typically need any modifications made to the output.
Being able to automatically generate the contents for the staging models in your project can reduce the time you spend creating them and increase how quickly a project can be set up.