DEV Community

Cover image for Study Notes 4.3.1 - Build the First dbt Models
Pizofreude
Pizofreude

Posted on

Study Notes 4.3.1 - Build the First dbt Models

1. Introduction to dbt Models

intro to dbt models

  • dbt (Data Build Tool) sits on top of data platforms like BigQuery or Postgres.
  • Objective: Transform raw data (e.g., green and yellow taxi trip data) into structured, usable formats for BI tools.
  • Approach: Modular data modeling using fact tables and dimensional tables.

modular data modeling theory

  • Process:
    • Start with raw data (sources).
    • Build SQL scripts (models) to clean, deduplicate, recast, and rename data.
    • Apply business logic to create fact and dimensional tables.
    • Aggregate data for stakeholders.

2. Anatomy of a dbt Model

  • Models are SQL files that contain SELECT statements.
  • Materialization: Determines how the model is stored in the database.
    • Types of Materialization:
      1. Ephemeral: Not physically stored; exists only within other models (like a CTE).
      2. View: Materialized as a view in the database.
      3. Table: Materialized as a physical table.
      4. Incremental: Updates only new data in the table (avoids recreating the entire table).

materializations in dbt cloud

  • File Structure:
    • Each model is a .sql file.
    • Example: model_name.sql contains a SELECT statement.
    • dbt compiles the SQL into DDL/DML (e.g., CREATE TABLE AS SELECT).

3. Defining Sources

  • Sources: Raw data tables loaded into the database (e.g., green and yellow taxi trip data).

defining sources

  • YAML Configuration:

    • Define sources in a schema.yml file.
    • Specify database, schema, and table names.
    • Example:

      version: 2
      sources:
        - name: taxi_trips
          database: my_database
          schema: trip_data
          tables:
            - name: green_trip_data
            - name: yellow_trip_data
      
      
  • Benefits:

    • Abstracts the complexity of source locations.
    • Allows for freshness testing to ensure data is up-to-date.

4. Using Seeds

  • Seeds: CSV files stored in the dbt repository (e.g., taxi_zone_lookup.csv).
  • Purpose: Used for small, static datasets (e.g., master data tables).
  • Benefits:
    • Version-controlled.
    • Can be documented and tested like other dbt models.
  • Example:
    • Load taxi_zone_lookup.csv as a seed.
    • Use it in transformations to enrich trip data with zone information.

5. Modular Data Modeling

  • Layers:
    1. Staging: Clean and prepare raw data.
      • Example: stg_green_trip_data.sql and stg_yellow_trip_data.sql.
    2. Core: Create fact and dimensional tables.
      • Example: fact_trips.sql and dim_zones.sql.

Image description

  • Ref Function:

    • Use ref('model_name') to reference other dbt models.
    • Automatically builds dependencies between models.
    • Example:

      SELECT * FROM {{ ref('stg_green_trip_data') }}
      
      

6. Macros in dbt

  • Macros: Reusable SQL code snippets (similar to functions in Python).
  • Purpose: Simplify repetitive tasks and dynamically generate SQL.
  • Example:

    • Macro to generate payment type descriptions:

      {% macro get_payment_type_description(payment_type) %}
          CASE CAST({{ payment_type }} AS INTEGER)
              WHEN 1 THEN 'Credit Card'
              WHEN 2 THEN 'Cash'
              ELSE 'Unknown'
          END
      {% endmacro %}
      
      
  • Usage:

    • Call the macro in a model:

      SELECT {{ get_payment_type_description('payment_type') }} AS payment_description
      
      

7. Packages in dbt

  • Packages: Libraries of reusable macros and models.
  • Example: dbt_utils package provides common utilities (e.g., surrogate_key).
  • Installation:

    • Add to packages.yml:

      packages:
        - package: dbt-labs/dbt_utils
          version: 0.8.0
      
      
    • Run dbt deps to install.

  • Usage:

    • Example: Generate a surrogate key:

      SELECT {{ dbt_utils.surrogate_key(['vendor_id', 'pickup_datetime']) }} AS trip_key
      
      

8. Variables in dbt

  • Variables: Dynamic values that can be defined at the project level or passed during execution.
  • Usage:

    • Define in dbt_project.yml:

      vars:
        payment_type_values: [1, 2, 3, 4, 5, 6]
      
      
    • Use in a model:

      {% if var('is_test_run', true) %}
          LIMIT 100
      {% endif %}
      
      
  • Execution:

    • Pass variables via CLI:

      dbt run --vars '{"is_test_run": false}'
      
      

9. Building Fact and Dimensional Tables

  • Fact Table: Aggregates data from multiple sources.
    • Example: fact_trips.sql combines green and yellow trip data.
    • Union both datasets and join with dim_zones for zone information.
  • Dimensional Table: Provides context for fact tables.
    • Example: dim_zones.sql contains zone details (e.g., borough, service zone).

10. Testing and Deployment

  • Testing:

    • Use dbt's built-in tests (e.g., uniqueness, not null).
    • Example:

      tests:
        - unique
        - not_null
      
      
  • Deployment:

    • Run dbt run to compile and execute models.
    • Use dbt build to run models and tests together.
    • Example:

      dbt build --vars '{"is_test_run": false}'
      
      
    • Best Practice to implement dev limit which can be toggle off when in production by simply setting default=false:

      -- dbt build --select <model_name> --vars '{'is_test_run': 'false'}'
      {% if var('is_test_run', default=true) %}
      
        limit 100
      
      {% endif %}
      

      → faster, cheaper and can be made as macro to follow DRY ( Don’t Repeat Yourself) methodology.


11. Lineage and Dependency Management

  • Lineage: Visual representation of dependencies between models.
  • Benefits:
    • Understand how data flows through the pipeline.
    • Identify upstream and downstream impacts of changes.
  • Example:
    • fact_trips depends on stg_green_trip_data, stg_yellow_trip_data, and dim_zones.

modular data modeling nyc taxi practical


12. Best Practices

  • Modularity: Break down transformations into reusable models.
  • Documentation: Use YAML files to document sources, models, and tests.
  • Version Control: Store dbt projects in Git for collaboration and history.
  • Performance: Materialize large tables as tables (not views) for better performance.

13. Key Commands

  • dbt run: Compiles and runs models.
  • dbt test: Runs tests on models.
  • dbt build: Runs models and tests together.
  • dbt deps: Installs packages.
  • dbt compile: Compiles SQL without executing it.

14. Example Project Structure

dbt_project/
├── models/
│   ├── staging/
│   │   ├── stg_green_trip_data.sql
│   │   ├── stg_yellow_trip_data.sql
│   ├── core/
│   │   ├── fact_trips.sql
│   │   ├── dim_zones.sql
├── seeds/
│   ├── taxi_zone_lookup.csv
├── macros/
│   ├── get_payment_type_description.sql
├── schema.yml
├── dbt_project.yml
├── packages.yml

Enter fullscreen mode Exit fullscreen mode

Top comments (0)