1. Introduction to Data Testing
- Objective: Ensure data delivered to end-users is accurate and prevent building models on incorrect data.
-
DVT Tests: Data Validation Tests (DVT) are assumptions about data. They are SQL-based checks that validate data integrity.
- Purpose: Identify errors quickly and stop execution if assumptions fail.
- Implementation: Defined in YAML files and compiled into SQL code.
2. Types of DVT Tests
-
Four Default Tests:
- Unique Test: Ensures a field's uniqueness within a data model.
- Not Null Test: Ensures a field does not contain null values.
- Accepted Values Test: Validates that a field contains only specified values.
- Foreign Key Test: Ensures a field has a valid relationship with another table.
-
Example:
- Payment Type Description: Must be one of the values 1, 2, 3, 4, or 5.
-
Pickup Location: Must relate to a valid taxi zone in the
ref_taxi_lookup
table. - Trip ID: Must be unique and not null (primary key).
3. Compiling and Running Tests
- YAML File: Contains test definitions.
-
SQL Compilation: Tests are translated into SQL queries.
- Example:
SELECT * FROM production.my_schema.yellow_trip_data WHERE trip_id IS NULL
.
- Example:
- Test Execution: If no values are returned, the test passes. If values are returned, the test fails or issues a warning.
4. Macros in DVT
-
Cross-Database Macros: Abstract SQL flavor differences across databases (e.g., BigQuery vs. PostgreSQL).
- Example:
DATE_TRUNC
andSAFE_CAST
macros adapt to the underlying database.
- Example:
- Usage: Simplifies SQL code and ensures compatibility across databases.
5. Code Generation with Kosan Package
- Purpose: Automates the generation of YAML files and base models.
-
Features:
- Generate Source: Creates YAML files for database sources.
- Generate Base Model: Creates CTE-based models.
- Generate Model YAML: Automates YAML file creation for models.
-
Usage:
- Add the Kosan package to the project.
- Use functions like
generate_model_yaml
to create YAML files for multiple models.
6. Adding Tests to Models
-
Steps:
- Define unique and not null tests for primary keys (e.g.,
trip_id
). - Add relationship tests for foreign keys (e.g.,
pickup_location_id
must relate toref_taxi_lookup
). - Add accepted values tests for specific fields (e.g.,
payment_type
must be 1, 2, 3, 4, or 5).
- Define unique and not null tests for primary keys (e.g.,
-
Severity Levels:
- Warning: Issues a warning but does not stop execution.
- Error: Stops execution if the test fails.
- Threshold: Define thresholds for warnings (e.g., up to 10 records can fail).
7. Building and Testing the Project
-
Process:
- Build the project, starting with sources and seeds.
- Execute tests for each model as it is built.
- Parallel execution: Independent models are built simultaneously.
-
Test Results:
- Warnings or errors are displayed for failed tests.
- Example: A warning is issued if
payment_type
contains values outside the accepted range.
8. Documentation in DVT
- Purpose: Create comprehensive documentation for the project.
-
Sources of Documentation:
- YAML Files: Descriptions of models and columns.
- Code: Lineage and dependencies extracted from the code.
- Information Schema: Metadata like table size.
-
Generating Documentation:
- Use
dbt docs generate
to create documentation files. - Use
dbt docs serve
to host documentation locally.
- Use
-
Documentation Features:
- Lineage diagrams.
- Test results and descriptions.
- Compiled SQL code and dependencies.
9. Hosting Documentation
-
Locally: Use
dbt docs serve
to host documentation on localhost. - Cloud: Documentation can be hosted on dbt Cloud.
- Example: Documentation includes model descriptions, column details, and test results.
10. Additional Tools and Packages
- dbt Utils: Provides additional utilities for testing and documentation.
- dbt Expectations: Based on Great Expectations, offers advanced testing capabilities.
11. Key Takeaways
- Testing: Essential for ensuring data accuracy and integrity.
- Automation: Use tools like Kosan to automate YAML and model generation.
- Documentation: Critical for maintaining transparency and understanding of the project.
- Flexibility: DVT macros and packages allow for cross-database compatibility and advanced testing.
Example YAML Snippet for Tests
version: 2
models:
- name: yellow_trip_data
columns:
- name: trip_id
tests:
- unique
- not_null
- name: pickup_location_id
tests:
- relationships:
to: ref_taxi_lookup
field: location_id
- name: payment_type
tests:
- accepted_values:
values: [1, 2, 3, 4, 5]
Commands for Documentation
# Generate documentation
dbt docs generate
# Serve documentation locally
dbt docs serve
By following these steps and utilizing the tools discussed, you can ensure your data models are robust, well-tested, and thoroughly documented.
Top comments (0)