DEV Community

Chetan Gupta
Chetan Gupta

Posted on • Edited on

An End-to-End Guide to dbt (Data Build Tool) with a Use Case Example

Introduction

In modern data engineering, creating scalable and manageable data pipelines is a critical aspect of maintaining clean, organized, and reliable data in data warehouses. dbt (Data Build Tool) has emerged as a popular open-source tool for data transformation, enabling data teams to apply software engineering best practices like version control, testing, and modularity in SQL transformations. This blog will take you through an end-to-end guide to dbt, walking through key concepts with a use case example to help you understand how it works in practice.

What is dbt?

dbt (Data Build Tool) allows data analysts and engineers to transform data within a warehouse by writing SQL queries. It enables you to model your data, run tests, and create documentation. dbt can transform raw data into organized, actionable analytics datasets using SQL and Jinja (a templating language).

Core Features of dbt:

  • Modularity: Break down SQL code into reusable pieces.
  • Version Control: Use git to manage changes in your dbt project.
  • Testing: Automated testing for data quality.
  • Documentation: Automatically generate documentation from your models.
  • Dependency Management: Automatically order transformations with Directed Acyclic Graph (DAG).
  • dbt Architecture Overview

Here’s a simplified flow:

  • Data Extraction happens elsewhere (ETL tools like Airbyte or Fivetran).
  • Data Loading: Raw data lands in your data warehouse (e.g., Snowflake, BigQuery, Redshift).
  • dbt comes into play for data transformation, where it organizes, cleans, and models data.
  • The transformed data is used for reporting or analysis.
  • Use Case: Sales Analytics Pipeline

Imagine you're building a data pipeline to report on an e-commerce company’s sales performance. The raw data is available in the following format:

  • Customers table
  • Orders table
  • Products table Our goal is to use dbt to transform this raw data into a Sales Report table that aggregates revenue per product and customer, with historical accuracy and data quality tests.

Setting Up Your dbt Project

Let’s walk through creating and managing a dbt project from scratch.

  1. Install dbt First, install dbt. This is done via Python’s package manager pip.

bash
pip install dbt
Once installed, you can create a new dbt project using the command:

bash
dbt init sales_analytics_project
Navigate to the new project directory:

bash
cd sales_analytics_project

  1. Configure dbt Profiles dbt needs to connect to your data warehouse (e.g., Snowflake, BigQuery, Redshift). Inside your ~/.dbt/profiles.yml, configure the connection settings.

Example for Snowflake:

yaml

sales_analytics_project:
  target: dev
  outputs:
    dev:
      type: snowflake
      account: your_snowflake_account
      user: your_username
      password: your_password
      role: your_role
      database: your_database
      warehouse: your_warehouse
      schema: analytics
      threads: 4
Enter fullscreen mode Exit fullscreen mode
  1. Organize Your dbt Models In dbt, models are SQL files that represent transformations. Let’s build our Sales Report model step by step.

a. Create Models Directory

Inside your dbt project, you will have a models folder. Let’s create three models:

Customers model: customers.sql
Orders model: orders.sql
Products model: products.sql
Each model will transform the raw data, and you can reference raw tables using ref().

b. Write SQL Transformations

Customers Model:
sql

-- models/customers.sql
WITH customers_cleaned AS (
    SELECT
        customer_id,
        first_name,
        last_name,
        email,
        created_at
    FROM {{ ref('raw_customers') }}
)

SELECT * FROM customers_cleaned;
Enter fullscreen mode Exit fullscreen mode

Orders Model:
sql

-- models/orders.sql
WITH orders_cleaned AS (
    SELECT
        order_id,
        customer_id,
        product_id,
        total_amount,
        order_date
    FROM {{ ref('raw_orders') }}
)

SELECT * FROM orders_cleaned;
Enter fullscreen mode Exit fullscreen mode

Products Model:
sql

-- models/products.sql
WITH products_cleaned AS (
    SELECT
        product_id,
        product_name,
        category,
        price
    FROM {{ ref('raw_products') }}
)

SELECT * FROM products_cleaned;
Enter fullscreen mode Exit fullscreen mode

c. Create Sales Report Model

Now, we’ll create a final Sales Report model by joining the above models.

sql

-- models/sales_report.sql
WITH sales AS (
    SELECT
        o.order_id,
        c.customer_id,
        c.first_name || ' ' || c.last_name AS customer_name,
        p.product_name,
        p.category,
        o.total_amount,
        o.order_date
    FROM {{ ref('orders') }} o
    JOIN {{ ref('customers') }} c
        ON o.customer_id = c.customer_id
    JOIN {{ ref('products') }} p
        ON o.product_id = p.product_id
)

SELECT
    product_name,
    category,
    SUM(total_amount) AS total_revenue,
    COUNT(order_id) AS total_orders
FROM sales
GROUP BY 1, 2
ORDER BY total_revenue DESC;
Enter fullscreen mode Exit fullscreen mode
  1. Testing Your dbt Models Testing is crucial to maintain data quality. dbt allows you to write tests to validate data.

a. Write Basic Tests

Create a new file called schema.yml in the models folder:

yaml

version: 2

models:
  - name: customers
    tests:
      - unique:
          column: customer_id
      - not_null:
          column: email

  - name: orders
    tests:
      - unique:
          column: order_id
      - not_null:
          column: total_amount
Enter fullscreen mode Exit fullscreen mode

These tests ensure that:

customer_id and order_id are unique.
email and total_amount are not null.
b. Run Tests

To run the tests, use the command:

bash
dbt test
If any tests fail, dbt will notify you and provide details.

  1. Snapshots for Slowly Changing Dimensions (SCD) Let’s implement a snapshot to track changes in the customer information over time (Slowly Changing Dimension Type 2).

a. Create a Snapshot

Inside the snapshots folder, create customers_snapshot.sql:

sql

-- snapshots/customers_snapshot.sql
{% snapshot customers_snapshot %}

    {{ config(
        target_schema='snapshots',
        unique_key='customer_id',
        strategy='check',
        check_cols=['first_name', 'last_name', 'email']
    ) }}

    SELECT * FROM {{ ref('raw_customers') }}

{% endsnapshot %}
Enter fullscreen mode Exit fullscreen mode

This snapshot will track changes to the first_name, last_name, and email columns for each customer.

b. Run Snapshots

To capture the current state, use the command:

bash
dbt snapshot
Over time, this snapshot will record changes, allowing you to maintain a historical view of customer data.

  1. Documentation in dbt dbt can automatically generate documentation for your models, tests, and lineage. To generate and serve documentation:

bash

dbt docs generate
dbt docs serve
Enter fullscreen mode Exit fullscreen mode

You’ll get an interactive website with details on your models, their columns, and dependencies.

  1. Run dbt Models Finally, run your dbt project to materialize the transformations and load the transformed data into your data warehouse.

bash
dbt run
This command will execute all models in the project, following the dependencies defined by the DAG (Directed Acyclic Graph).

Conclusion

This end-to-end guide provides an overview of setting up and using dbt in a real-world use case. We covered:

Project setup and configuration.
Writing transformations using models.
Running data tests to ensure quality.
Creating snapshots for tracking changes over time.
Generating documentation for better data lineage visibility.
By using dbt, you can simplify your data transformation workflows, ensure data quality, and build more maintainable, modular pipelines. With the sales analytics example, you now have a good foundation to explore other advanced dbt features like macros, incremental models, and CI/CD integration.

Happy Transforming!

Top comments (0)