Diana Tahchieva for Innovation Process Technology AG (ipt)

Posted on Feb 24

Data Governance with dbt, Terraform, and Dataplex: A Practical Guide to BigQuery Policy Tags

#datagovernance #googlecloud #terraform #dbt

Welcome to a hands-on guide for implementing BigQuery Policy Tags, an important feature for data governance. If you're new to Google Cloud Platform (GCP) and have heard of dbt, Terraform, and Data Catalog but aren't sure how they work together, this tutorial provides a simple, practical example. We'll apply policy tags to a sample clients table in BigQuery to enforce data governance.

What Are Policy Tags?
Policy Tags are classification labels for data in BigQuery, helping manage privacy, compliance, and access control. These tags are particularly important in industries like healthcare and finance, where data sensitivity is a key concern.

Why Use dbt, Terraform, and Dataplex for Policy Tag Management?
Using dbt and Terraform to define Policy Tags as code, and Dataplex for governance, you can keep track of changes, facilitate team collaboration, audit activities, and easily roll back to previous configurations.
Additionally, you can consistently manage Policy Tags across multiple datasets and projects, reducing manual labor. Automation through these tools minimizes human error and ensures policy tags are consistently applied.
While dbt integrates seamlessly into existing data pipelines, applying Policy Tags during data transformation and modeling, dataplex unifies governance across various data stores.

Understanding the Tools

Let us have a closer look at what are dbt, terraform and dataplex.
• dbt (Data Build Tool) is an open-source tool for transforming and modeling data within your data warehouse (like BigQuery). dbt enables you to write a more maintainable SQL code and allows you to attach metadata, such as Policy Tags, to your data transformations. Additionally, as the objects in BigQuery can be referenced, dbt is able to make a Directed Acyclic Graph of the entire data platform, based on which we can observe all dependencies among the data.
• Terraform is an Infrastructure as Code (IaC) tool that lets you define and manage your cloud infrastructure using configuration files. Terraform automates the provisioning and management of resources including enabling APIs, managing permissions, and creating Policy Taxonomies and Tags.
• Dataplex is a Google Cloud service that provides unified data governance and management. It helps discover, organize, and manage data assets, ensuring consistent data handling and enforcement of Policy Tags.

How They Work Together

OK, now that we know what dbt, Terraform, and Dataplex are, let's explore how they work together.
Terraform establishes the required infrastructure and permissions for managing Policy Tags by creating taxonomies and tags within Google Cloud Data Catalog. At the same time, dbt handles data transformation in BigQuery, applying Policy Tags to specific columns within your models. The meta section in the dbt model facilitates metadata association, ensuring proper organization and governance. Meanwhile, Dataplex functions as a centralized governance layer, maintaining consistency in the application and monitoring of Policy Tags across all data assets. Together, these tools create a seamless, scalable, and automated data governance system that enhances visibility, reduces manual effort, and minimizes the risk of human error.

Implementing Policy Tags: Step-by-Step Guide

First, you need to create a GCP project and connect it to a billing account; otherwise, Terraform won’t be able to function. Don’t worry—this project is small and won’t incur any costs (my billing account is still at zero). Just remember to delete it after testing by running terraform destroy in the command line—I’ll remind you at the end of the tutorial.
For convenience, I’ve created a Git project that you can clone. It contains two sub-projects: one for Terraform (gcp-data-catalog-terraform) and one for dbt (data_catalog_dbt_project). In a real-world scenario, these sub-projects would likely be managed as separate projects.

The terraform project has the following structure:
• variables.tf: Defines the GCP project ID and region.
• iam.tf: Creates service accounts for Terraform and dbt, assigning necessary IAM roles.
• datacatalog.tf: Defines the taxonomy for organizing Policy Tags and creates tags for PII and non-PII data.
• bigquery.tf: Creates a BigQuery dataset and a table without predefined policy tags.
• output.tf: Outputs IDs for easy access to created resources.

The dbt project in this example is a significantly simplified version of a standard dbt project and includes only the essential files necessary for our use case. Here’s a brief description of each file and its purpose:
• dbt_project.yml: This is the primary configuration file for your dbt project. It includes metadata about the project, such as the project name and version, paths to model files, and configurations for materializations and other project-wide settings.
• profiles.yml: This configuration file contains the connection details and credentials required for dbt to connect to your data warehouse (BigQuery in this case). It includes information such as the project ID, dataset, and authentication method (service account key file).
• models/customers/: This directory holds the models for the project. In dbt, a model is essentially a SQL file that transforms raw data into more refined tables. This directory contains our specific model for customers.
o customers.sql: This SQL file represents the transformation logic for the customers’ data. It selects and processes the necessary columns from the raw data, applying the transformations required for our data analysis needs.
o customers.yml: This YAML file provides additional metadata about the customers model. It includes descriptions of each column, tests to ensure data quality, and policy tags to enforce data governance rules.

Step 1: Enable APIs with Terraform

First, enable the necessary Google Cloud APIs for your project using Terraform. These are:
• Identity and Access Management (IAM) API
• BigQuery
• Data Catalog API
• Dataplex API

Step 2: Grants Permissions with Terraform

Let us walk through how to create and configure a service account that Terraform and dbt can use to interact with Google Cloud resources.
A service account in Google Cloud is like a robot user—it allows Terraform and dbt to authenticate and interact with GCP without needing a human user to log in.
In our setup, Terraform will provision BigQuery datasets, tables, and policies, while dbt will query and transform the data. To ensure that the permissions for Terraform and dbt are properly separated, we will define two distinct service accounts.

We can create the new service accounts (called terraform-sa and dbt-sa) by running:
gcloud iam service-accounts create terraform-sa --display-name "Terraform Service Account" --project gcp-data-governance
gcloud iam service-accounts create dbt-sa --display-name "dbt Service Account" --project gcp-data-governance

Alternatively, you can add the service account manually in the GCP console (IAM & Admin > Service Accounts). Then, in Terraform, you only manage the IAM roles in iam.tf

The Terraform service account needs permissions to manage BigQuery, Data Catalog, and Dataplex. Add these IAM roles in your Terraform configuration:

# Give the Terraform user permissions to manage IAM, Dataplex, and BigQuery
resource "google_project_iam_member" "terraform_bigquery_admin" {
  project = var.project_id
  role    = "roles/bigquery.admin"
  member  =  "serviceAccount:terraform-sa@gcp-data-governance.iam.gserviceaccount.com" 
}

resource "google_project_iam_member" "terraform_datacatalog_admin" {
  project = var.project_id
  role    = "roles/datacatalog.admin"
  member  = "serviceAccount:terraform-sa@gcp-data-governance.iam.gserviceaccount.com"
}

resource "google_project_iam_member" "terraform_dataplex_admin" {
  project = var.project_id
  role    = "roles/dataplex.admin"
  member  = "serviceAccount:terraform-sa@gcp-data-governance.iam.gserviceaccount.com"
}

resource "google_project_iam_member" "terraform_sa_admin" {
  project = var.project_id
  role    = "roles/iam.serviceAccountAdmin"
  member  = "serviceAccount:terraform-sa@gcp-data-governance.iam.gserviceaccount.com"
}

Note: For more information on the available roles, refer to the documentation. As a best practice, always assign the minimum permissions necessary. For instance, if a user only needs to view the data, provide read-only access.

• To allow dbt to authenticate, we need to generate a JSON key file for our service account, which we can do in two ways:

Option 1: Using the Google Cloud Console

Go to IAM & Admin → Service Accounts in the Google Cloud Console.
Select the service account dbt-sa.
Navigate to the Keys tab and click "Add Key".
Choose "JSON", then click "Create".
The key.json file will be downloaded to your computer.

Option 2: Using the gcloud CLI

If you prefer using the command line, run:

gcloud iam service-accounts keys create dbt-sa-key.json  --iam-account=dbt-sa@your-project-id.iam.gserviceaccount.com

• Now that we have our dbt-sa-key.json, we need to update dbt's configuration to use the service account. Open profiles.yml and update the value of your keyfile:

data_catalog_dbt:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: service-account
      project: "gcp-data-governance"  # project_id
      dataset: multiplexer_dataset
      threads: 4
      keyfile: "/path/to/dbt-sa-key.json"  # Reference your environment variable
      location: "europe-west6"  # Set this to your BigQuery region

Step 3: Create Data Policy Taxonomies and Tags with Terraform

To allow Terraform to interact with Google Data Catalog we need to specify providers. The providers allow us to manage resources on a specific cloud platform. We have created them in datacatalog.tf as they are relevant to Data Catolg API, however you can make a separate file providers.tf and define them there. The google provider is the standard Terraform provider for managing GCP resources, while the google-beta provider gives access to Google Data Catalog, which is needed for policy tags.

Why Not Just Use google-beta?

Some resources (e.g., IAM, BigQuery datasets) don’t need google-beta, so we keep google for those. On the other hand side Data Catalog resources require google-beta, so we configure both providers.

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 4.0"
    }
  }
}

provider "google" {
  project = var.project_id
  region  = var.region
}

provider "google-beta" {
  project = var.project_id
  region  = var.region
}

A taxonomy is a container containing data policy tags ( a grouping mechanism). Fine-Grained Access Control specified in the activated policy tags ensures that only authorized users can see or query certain columns.

# Create a Data Catalog Taxonomy
resource "google_data_catalog_taxonomy" "multiplexer_pii_taxonomy" {
  provider = google-beta
  display_name = "Multiplexer PII Taxonomy"
  description  = "Taxonomy for sensitive data classification"
  project      = var.project_id
  region       = var.region
  activated_policy_types = ["FINE_GRAINED_ACCESS_CONTROL"]
}

In our test-case we define only two tags: sensitive (PII) and non-sensitive data. The PII tags ensure proper access control. These tags will later be attached to BigQuery columns in dbt. More tags can be added as needed for better governance.

# Create Policy Tags for PII and Non-PII
resource "google_data_catalog_policy_tag" "pii_sensitive" {
  provider = google-beta
  taxonomy     = google_data_catalog_taxonomy.multiplexer_pii_taxonomy.id
  display_name = "PII"
  description  = "Policy tag for Personally Identifiable Information"
}

resource "google_data_catalog_policy_tag" "non_pii_sensitive" {
  provider = google-beta
  taxonomy     = google_data_catalog_taxonomy.multiplexer_pii_taxonomy.id
  display_name = "Non-PII"
  description  = "Policy tag for non-sensitive data"
}

Step 4: Apply Policy Tags with dbt

Now that the Policy Tags are defined, we will attach them to relevant columns in the dbt project:

Set up your dbt project in a directory parallel to your Terraform project. Before running dbt, ensure that your service account key is correctly set (see explanation in Step 2). Remember, dbt profile.yml should be configured to use BigQuery and your service account key. Before running dbt, make sure all dependencies are installed:

dbt deps –upgrade

Afterwards check if dbt can connect to BigQuery, by running:

dbt debug

If successful, you’ll see the message:

All checks passed!

To run all dbt models use dbt run and to run only specific models, e.g., customers use dbt run -s customers

Define columns and attach Policy Tags in your dbt models.

Navigate to your dbt project and modify your customers.yml to attach Policy Tags:

version: 2

models:
  - name: customers
    description: "Customers data with PII"
    meta:
    columns:
      - name: customer_id
        description: "Unique customer identifier"
        policy_tags:
          - "{{ var('non_pii_sensitive_policy_tag_id') }}"
      - name: email
        description: "Customer email"
        policy_tags:
          - "{{ var('pii_sensitive_policy_tag_id') }}"
      - name: phone_number
        description: "Customer phone number"
        policy_tags:
          - "{{ var('pii_sensitive_policy_tag_id') }}"

Now run dbt to apply the Policy Tags to BigQuery automatically: dbt run -s customers

Conclusion
Integrating dbt, Terraform, and Dataplex allows you to efficiently manage BigQuery Policy Tags, enforcing data governance policies in a scalable and automated way. This approach enhances security, compliance, and operational efficiency while reducing manual effort.

To avoid unnecessary charges, remember to destroy your project after testing by running terraform destroy in the command line.

DEV Community

Data Governance with dbt, Terraform, and Dataplex: A Practical Guide to BigQuery Policy Tags

Top comments (0)

Read next

Switching to the Terraform S3 Backend with Native State File Locks

How to Build a Simple AWS Test Environment with Terraform

Day 2: Setting Up Terraform for AWS

A Complete Guide to Database Services in Google Cloud Platform: Features, Capacity, and Popularity