1. Introduction to dbt Models
- dbt (Data Build Tool) sits on top of data platforms like BigQuery or Postgres.
- Objective: Transform raw data (e.g., green and yellow taxi trip data) into structured, usable formats for BI tools.
- Approach: Modular data modeling using fact tables and dimensional tables.
-
Process:
- Start with raw data (sources).
- Build SQL scripts (models) to clean, deduplicate, recast, and rename data.
- Apply business logic to create fact and dimensional tables.
- Aggregate data for stakeholders.
2. Anatomy of a dbt Model
-
Models are SQL files that contain
SELECT
statements. -
Materialization: Determines how the model is stored in the database.
-
Types of Materialization:
- Ephemeral: Not physically stored; exists only within other models (like a CTE).
- View: Materialized as a view in the database.
- Table: Materialized as a physical table.
- Incremental: Updates only new data in the table (avoids recreating the entire table).
-
Types of Materialization:
-
File Structure:
- Each model is a
.sql
file. - Example:
model_name.sql
contains aSELECT
statement. - dbt compiles the SQL into DDL/DML (e.g.,
CREATE TABLE AS SELECT
).
- Each model is a
3. Defining Sources
- Sources: Raw data tables loaded into the database (e.g., green and yellow taxi trip data).
-
YAML Configuration:
- Define sources in a
schema.yml
file. - Specify database, schema, and table names.
-
Example:
version: 2 sources: - name: taxi_trips database: my_database schema: trip_data tables: - name: green_trip_data - name: yellow_trip_data
- Define sources in a
-
Benefits:
- Abstracts the complexity of source locations.
- Allows for freshness testing to ensure data is up-to-date.
4. Using Seeds
-
Seeds: CSV files stored in the dbt repository (e.g.,
taxi_zone_lookup.csv
). - Purpose: Used for small, static datasets (e.g., master data tables).
-
Benefits:
- Version-controlled.
- Can be documented and tested like other dbt models.
-
Example:
- Load
taxi_zone_lookup.csv
as a seed. - Use it in transformations to enrich trip data with zone information.
- Load
5. Modular Data Modeling
-
Layers:
-
Staging: Clean and prepare raw data.
- Example:
stg_green_trip_data.sql
andstg_yellow_trip_data.sql
.
- Example:
-
Core: Create fact and dimensional tables.
- Example:
fact_trips.sql
anddim_zones.sql
.
- Example:
-
Staging: Clean and prepare raw data.
-
Ref Function:
- Use
ref('model_name')
to reference other dbt models. - Automatically builds dependencies between models.
-
Example:
SELECT * FROM {{ ref('stg_green_trip_data') }}
- Use
6. Macros in dbt
- Macros: Reusable SQL code snippets (similar to functions in Python).
- Purpose: Simplify repetitive tasks and dynamically generate SQL.
-
Example:
-
Macro to generate payment type descriptions:
{% macro get_payment_type_description(payment_type) %} CASE CAST({{ payment_type }} AS INTEGER) WHEN 1 THEN 'Credit Card' WHEN 2 THEN 'Cash' ELSE 'Unknown' END {% endmacro %}
-
-
Usage:
-
Call the macro in a model:
SELECT {{ get_payment_type_description('payment_type') }} AS payment_description
-
7. Packages in dbt
- Packages: Libraries of reusable macros and models.
-
Example:
dbt_utils
package provides common utilities (e.g.,surrogate_key
). -
Installation:
-
Add to
packages.yml
:
packages: - package: dbt-labs/dbt_utils version: 0.8.0
Run
dbt deps
to install.
-
-
Usage:
-
Example: Generate a surrogate key:
SELECT {{ dbt_utils.surrogate_key(['vendor_id', 'pickup_datetime']) }} AS trip_key
-
8. Variables in dbt
- Variables: Dynamic values that can be defined at the project level or passed during execution.
-
Usage:
-
Define in
dbt_project.yml
:
vars: payment_type_values: [1, 2, 3, 4, 5, 6]
-
Use in a model:
{% if var('is_test_run', true) %} LIMIT 100 {% endif %}
-
-
Execution:
-
Pass variables via CLI:
dbt run --vars '{"is_test_run": false}'
-
9. Building Fact and Dimensional Tables
-
Fact Table: Aggregates data from multiple sources.
- Example:
fact_trips.sql
combines green and yellow trip data. - Union both datasets and join with
dim_zones
for zone information.
- Example:
-
Dimensional Table: Provides context for fact tables.
- Example:
dim_zones.sql
contains zone details (e.g., borough, service zone).
- Example:
10. Testing and Deployment
-
Testing:
- Use dbt's built-in tests (e.g., uniqueness, not null).
-
Example:
tests: - unique - not_null
-
Deployment:
- Run
dbt run
to compile and execute models. - Use
dbt build
to run models and tests together. -
Example:
dbt build --vars '{"is_test_run": false}'
-
Best Practice to implement
dev limit
which can be toggle off when in production by simply settingdefault=false
:
-- dbt build --select <model_name> --vars '{'is_test_run': 'false'}' {% if var('is_test_run', default=true) %} limit 100 {% endif %}
→ faster, cheaper and can be made as macro to follow DRY ( Don’t Repeat Yourself) methodology.
- Run
11. Lineage and Dependency Management
- Lineage: Visual representation of dependencies between models.
-
Benefits:
- Understand how data flows through the pipeline.
- Identify upstream and downstream impacts of changes.
-
Example:
-
fact_trips
depends onstg_green_trip_data
,stg_yellow_trip_data
, anddim_zones
.
-
12. Best Practices
- Modularity: Break down transformations into reusable models.
- Documentation: Use YAML files to document sources, models, and tests.
- Version Control: Store dbt projects in Git for collaboration and history.
- Performance: Materialize large tables as tables (not views) for better performance.
13. Key Commands
-
dbt run
: Compiles and runs models. -
dbt test
: Runs tests on models. -
dbt build
: Runs models and tests together. -
dbt deps
: Installs packages. -
dbt compile
: Compiles SQL without executing it.
14. Example Project Structure
dbt_project/
├── models/
│ ├── staging/
│ │ ├── stg_green_trip_data.sql
│ │ ├── stg_yellow_trip_data.sql
│ ├── core/
│ │ ├── fact_trips.sql
│ │ ├── dim_zones.sql
├── seeds/
│ ├── taxi_zone_lookup.csv
├── macros/
│ ├── get_payment_type_description.sql
├── schema.yml
├── dbt_project.yml
├── packages.yml
Top comments (0)