Pizofreude

Posted on Feb 11

Study Note DE Zoomcamp 3.1.1 - Data Warehouse and BigQuery

#dataengineering #dezoomcamp #bigquery #datawarehouse

Introduction

This lecture covers data warehouses, with a focus on BigQuery.
Topics include OLAP vs. OLTP, data warehouse architecture, BigQuery features, cost considerations, partitioning, clustering, and query optimization.

1. OLAP vs. OLTP

Definition and Use Cases

OLTP (Online Transaction Processing)
- Used in backend services and transactional databases.
- Supports quick, small updates.
- Data is highly normalized for efficiency.
- Example: Online shopping transactions.
OLAP (Online Analytical Processing)
- Used for analytical queries and reporting.
- Stores large datasets, updated periodically.
- Data is denormalized to enhance query performance.
- Example: Business intelligence reports for executives.

2. What is a Data Warehouse?

A centralized repository used for reporting and analysis.
Consists of:
- Raw data
- Metadata
- Summarized data
Architecture:
- Data is sourced from OLTP systems, flat files, operational databases, etc.
- A staging area processes data before it is stored in the warehouse.
- Data can be accessed through data marts tailored for specific users (e.g., sales, inventory).

Analysts may access data marts, whereas data scientists may work directly with raw data.

3. BigQuery Overview

What is BigQuery?

A fully managed, serverless data warehouse by Google Cloud.
No need for infrastructure management.
Scalable from GBs to PBs of data.
Supports SQL-based machine learning (BigQuery ML), geospatial data, and business intelligence (BI).

Storage Model

Decouples compute and storage:
- Compute engine analyzes data stored externally.
- This reduces cost and increases scalability.

4. BigQuery Cost Model

Pricing Models

On-Demand Pricing
- Pay per data processed ($5 per TB).
- Ideal for smaller or variable workloads.
Flat-Rate Pricing
- Pre-purchased slots (100 slots ≈ $2,000/month).
- Recommended for workloads >200 TB/month.

Cost Optimization Strategies

Avoid unnecessary queries.
Use partitioning and clustering.
Leverage cached query results.

5. BigQuery External Tables

Query data stored in Google Cloud Storage (GCS) without importing.
Schema can be automatically inferred.
Limitation: BigQuery cannot estimate table size or row count before querying.

6. Partitioning in BigQuery

Why Partition Data?

Improves query performance and reduces cost by scanning only relevant data.
Common partitioning strategies:
- Date-based (e.g., timestamp columns)
- Integer-based (e.g., user IDs)

Example: Query Optimization with Partitioning

Querying non-partitioned data (June 2019) processes 1.6 GB.
Querying partitioned data (same period) processes only 106 MB.

7. Clustering in BigQuery

What is Clustering?

Organizes data within partitions for further optimization.
Common clustering keys: Vendor ID, user ID, country, category, etc.

Example: Partition vs. Partition + Clustering

Partition-only Query (June 2019 – Dec 2020) processed 1.1 GB.
Partition + Cluster Query (same period) processed 843 MB.
Result: Faster queries and lower costs.

8. Query Optimization in BigQuery

Best Practices

Use partitioning and clustering.
Filter data before aggregation.
*Avoid SELECT ** when unnecessary.
Use BI tools (e.g., Data Studio) for visualization.

Performance Monitoring

Check query execution plan.
Monitor processed data size.
Use BigQuery's information schema to analyze query behavior.

9. Summary

BigQuery is a powerful, serverless data warehouse.
Partitioning and clustering help reduce costs and improve performance.
On-demand pricing is flexible, while flat-rate pricing suits high-volume users.
External tables allow querying data from GCS without importing.
Optimizing queries reduces costs and enhances speed.

DEV Community

Study Note DE Zoomcamp 3.1.1 - Data Warehouse and BigQuery

Introduction

1. OLAP vs. OLTP

Definition and Use Cases

2. What is a Data Warehouse?

3. BigQuery Overview

What is BigQuery?

Storage Model

4. BigQuery Cost Model

Pricing Models

Cost Optimization Strategies

5. BigQuery External Tables

6. Partitioning in BigQuery

Why Partition Data?

Example: Query Optimization with Partitioning

7. Clustering in BigQuery

What is Clustering?

Example: Partition vs. Partition + Clustering

8. Query Optimization in BigQuery

Best Practices

Performance Monitoring

9. Summary

Further Reading & Resources

Top comments (0)

Read next

Eigen Decomposition of a Matrix

LLM Function-Calling Performance: API- vs User-Aligned

How I Built a Marketplace to Help French Artisans Sell Their Businesses

Understanding Git Workflows Beyond the Basics