Introduction
In today's data-driven world, transforming raw data into valuable insights is crucial. This process, however, often involves complex tasks that demand efficiency, scalability, and reliability. Enter dbt Cloud—a powerful tool that simplifies data transformations on Google BigQuery. In this article, I'll take you through a step-by-step guide on how to run BigQuery transformations using dbt Cloud. Let's dive in!
Prerequisites
dbt cloud account
gcp account
for bigquery ensure you have the following
- Service account with elevated roles and keys.json
- bigquery api enabled
- two bigquery datasets: staging and production
Step 1: Setting Up Your dbt Cloud Project
Start by signing in to dbt Cloud and creating a new project.
- Create a project and define the connection to bigquery. Dbt requires you to upload a
keys.json
file that contains your biqguery credentials.Define your target dataset, this will be the destination for your transformed data duringdevelopment
. In this case mentalhealth_staging. You can choose a managed repository on clone a repository from github. - Initialize your project. DBT will create files and folders for you to start with.
- By default, dbt will only allow you to work on a branch, therefore create a branch or more for which you can use for development or deployment
Step 2: Creating Transformations
In dbt Cloud, transformations are defined as dbt models. A dbt model is a SQL file containing the transformation logic. Write your SQL queries to transform and reshape your data as needed. These models can join tables, aggregate data, and create calculated fields.
- define the
dbt_project.yml
. Thedbt_project.yml
file is a configuration file used in dbt (data build tool) projects. It's a YAML (Yet Another Markup Language) file that allows you to define various project settings and configurations in a structured format. This file serves as the central configuration hub for your dbt project, allowing you to customize how dbt behaves when executing data transformations.
# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'bigquery_proj'
version: '1.0.0'
config-version: 2
# This setting configures which "profile" dbt uses for this project.
profile: 'default'
# These configurations specify where dbt should look for different types of files.
# The `source-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target" # directory which will store compiled SQL files
clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_packages"
# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models
# In this example config, we tell dbt to build all models in the example/ directory
# as tables. These settings can be overridden in the individual model files
# using the `{{ config(...) }}` macro.
models:
bigquery_proj:
staging:
materialized: table
core:
materialized: table
- Next we define dbt models for defining transformations. We shall create staging models for development and core models for deployment, we then add a macro for enhancing our models.
macros
are reusable SQL code snippets that allow you to encapsulate and parameterize common SQL operations or patterns. Macros in dbt provide a way to abstract and simplify complex SQL logic, making your dbt models more modular, maintainable, and efficient.
{#
This macro returns gender into three categories
#}
{% macro get_gender_properties(Gender) -%}
case {{ Gender }}
when 'male' then 'male'
when 'female' then 'female'
when 'f' then 'female'
when 'm' then 'male'
else 'others'
end
{%- endmacro %}
1.staging models
extract_mentalhealthdata.sql
{{ config(
materialized='table'
) }}
select
-- identifier
rand() as unique_id,
-- use macro for coverting gender types
{{ get_gender_properties('Gender') }} as gendertype,
*
from {{ source('mentalhealth_staging','mental_health_table_1') }}
the above model extracts data from a bigquery dataset mentalhealth_staging .It also uses the defined macros
below is the schema for the model
version: 2
sources:
- name: mentalhealth_staging
#database:
schema: mentalhealth_staging
tables:
- name: mental_health_table_1
models:
- name: extract_mentalhealthdata
description: "extract mental health and load to a staging table"
columns:
- name: unique_id
description: random unique id for every record
tests:
- unique
- not_null
- name: Timestamp
description: Time the survey was submitted
- name: Age
description: Respondent age
- name: Gender
description: Respondent gender
- name: Country
description: Respondent country
- name: state
description: If you live in the United States, which state or territory do you live in?
- name: self_employed
description: Are you self-employed?
- name: family_history
description: Do you have a family history of mental illness?
- name: treatment
description: Have you sought treatment for a mental health condition?
- name: work_interfere
description: If you have a mental health condition, do you feel that it interferes with your work?
- name: no_employees
description: How many employees does your company or organization have?
- name: remote_work
description: How many employees does your company or organization have?
- name: tech_company
description: your employer primarily a tech company/organization?
- name: benefits
description: Does your employer provide mental health benefits?
- name: care_options
description: Do you know the options for mental health care your employer provides?
- name: wellness_program
description: Has your employer ever discussed mental health as part of an employee wellness program?
- name: seek_help
description: Does your employer provide resources to learn more about mental health issues and how to seek help?
- name: anonymity
description: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment
- name: leave
description: How easy is it for you to take medical leave for a mental health condition?
- name: mental_health_consequence
description: Do you think that discussing a mental health issue with your employer would have negative consequences?
- name: phy_health_consequence
description: Do you think that discussing a physical health issue with your employer would have negative consequences?
- name: coworkers
description: Would you be willing to discuss a mental health issue with your coworkers?
- name: supervisors
description: Would you be willing to discuss a mental health issue with your direct supervisor(s)?
- name: mental_health_interview
description: Would you bring up a mental health issue with a potential employer in an interview?
- name: phys_health_interview
description: Would you bring up a physical health issue with a potential employer in an interview
- name: mental_vs_physical
description: Do you feel that your employer takes mental health as seriously as physical health?
- name: obs_consequence
description: Have you heard of or observed negative consequences for coworkers with mental health conditions in your
- name: comments
description: Any additional notes or comments
2.core models
The core models ae used to load data from the staging tables and persist to production tables in mentalhealth_prod
dataset
employee_dim.sql
{{ config(
materialized='table'
) }}
select
-- identifier
unique_id,
-- employee details
Age,
gendertype,
Country,
state,
remote_work,
tech_company
from {{ source('mentalhealth_staging','extract_mentalhealthdata')}}
below is the defined schema the core models
version: 2
sources:
- name: mentalhealth_staging
#database:
schema: mentalhealth_staging
tables:
- name: extract_mentalhealthdata
models:
- name: employee_dim
description: "create employee dim tablele"
columns:
- name: unique_id
description: unique id for every record
tests:
- unique
- not_null
- name: Age
- name: gender_type
- name: Country
- name: state
- name: remote_work
- name: tech_company
Ensure you configure the schemas and data sources correctly.
now run the staging model using dbt build --select extract_mentalhealthdata.sql
this will create a table with transformed data based on the model. It will also run some tests defined in the schema.yml
Step 3: Creating deployment environment and running the jobs in development environment
In dbt Cloud, you can create environment like "production" to manage production of your data transformation process. Configure your target dataset and other settings for your environment.
Before deploying your transformations, it's wise to test them in the development environment. Trigger the job and monitor the logs to identify any issues. If everything runs smoothly, you're ready to move to the deployment phase.
- Create a deployment environment in dbt and define the target dataset in this case
mentalhealth_prod
. Note that this deployment environment runs the production branch created.
- Next you create jobs to run you models in production env. Here you can define other parameters such as scheduled time of running jobs, commands to run, generate docs etc.
Setting other parameters for the job runs
Step 7: Monitoring and Maintenance
dbt Cloud provides a dashboard to monitor the status of your jobs. Keep an eye on the execution logs and any potential errors.
Regularly update your transformations as your data and business requirements evolve.
Conclusion:
Running BigQuery transformations using dbt Cloud streamlines the process of turning raw data into actionable insights. With a clear step-by-step approach, you can easily set up, develop, test, and deploy your transformations to production. This ensures that your organization benefits from accurate and timely data-driven decisions. Harness the power of dbt Cloud to elevate your data transformation capabilities and propel your business forward.
In just a few minutes, you've learned how to leverage dbt Cloud's capabilities to transform data in BigQuery, all while simplifying complex processes and increasing efficiency. So, what are you waiting for? It's time to unlock the potential of your data!
Top comments (0)