DEV Community

Cover image for Study Note DE Zoomcamp 3.1.1 - Data Warehouse and BigQuery
Pizofreude
Pizofreude

Posted on

Study Note DE Zoomcamp 3.1.1 - Data Warehouse and BigQuery

Introduction

  • This lecture covers data warehouses, with a focus on BigQuery.
  • Topics include OLAP vs. OLTP, data warehouse architecture, BigQuery features, cost considerations, partitioning, clustering, and query optimization.

1. OLAP vs. OLTP

Definition and Use Cases

  • OLTP (Online Transaction Processing)
    • Used in backend services and transactional databases.
    • Supports quick, small updates.
    • Data is highly normalized for efficiency.
    • Example: Online shopping transactions.
  • OLAP (Online Analytical Processing)
    • Used for analytical queries and reporting.
    • Stores large datasets, updated periodically.
    • Data is denormalized to enhance query performance.
    • Example: Business intelligence reports for executives.

OLTP vs OLAP 1

OLTP vs OLAP 2

2. What is a Data Warehouse?

  • A centralized repository used for reporting and analysis.
  • Consists of:
    • Raw data
    • Metadata
    • Summarized data
  • Architecture:
    • Data is sourced from OLTP systems, flat files, operational databases, etc.
    • A staging area processes data before it is stored in the warehouse.
    • Data can be accessed through data marts tailored for specific users (e.g., sales, inventory).

Data warehouse architecture

  • Analysts may access data marts, whereas data scientists may work directly with raw data.

3. BigQuery Overview

What is BigQuery?

  • A fully managed, serverless data warehouse by Google Cloud.
  • No need for infrastructure management.
  • Scalable from GBs to PBs of data.
  • Supports SQL-based machine learning (BigQuery ML), geospatial data, and business intelligence (BI).

Storage Model

  • Decouples compute and storage:
    • Compute engine analyzes data stored externally.
    • This reduces cost and increases scalability.

4. BigQuery Cost Model

Pricing Models

  1. On-Demand Pricing
    • Pay per data processed ($5 per TB).
    • Ideal for smaller or variable workloads.
  2. Flat-Rate Pricing
    • Pre-purchased slots (100 slots ≈ $2,000/month).
    • Recommended for workloads >200 TB/month.

Cost Optimization Strategies

  • Avoid unnecessary queries.
  • Use partitioning and clustering.
  • Leverage cached query results.

5. BigQuery External Tables

  • Query data stored in Google Cloud Storage (GCS) without importing.
  • Schema can be automatically inferred.
  • Limitation: BigQuery cannot estimate table size or row count before querying.

6. Partitioning in BigQuery

Why Partition Data?

  • Improves query performance and reduces cost by scanning only relevant data.
  • Common partitioning strategies:
    • Date-based (e.g., timestamp columns)
    • Integer-based (e.g., user IDs)

Example: Query Optimization with Partitioning

  • Querying non-partitioned data (June 2019) processes 1.6 GB.
  • Querying partitioned data (same period) processes only 106 MB.

7. Clustering in BigQuery

What is Clustering?

  • Organizes data within partitions for further optimization.
  • Common clustering keys: Vendor ID, user ID, country, category, etc.

Example: Partition vs. Partition + Clustering

  • Partition-only Query (June 2019 – Dec 2020) processed 1.1 GB.
  • Partition + Cluster Query (same period) processed 843 MB.
  • Result: Faster queries and lower costs.

8. Query Optimization in BigQuery

Best Practices

  • Use partitioning and clustering.
  • Filter data before aggregation.
  • *Avoid SELECT ** when unnecessary.
  • Use BI tools (e.g., Data Studio) for visualization.

Performance Monitoring

  • Check query execution plan.
  • Monitor processed data size.
  • Use BigQuery's information schema to analyze query behavior.

9. Summary

  • BigQuery is a powerful, serverless data warehouse.
  • Partitioning and clustering help reduce costs and improve performance.
  • On-demand pricing is flexible, while flat-rate pricing suits high-volume users.
  • External tables allow querying data from GCS without importing.
  • Optimizing queries reduces costs and enhances speed.

Further Reading & Resources

Top comments (0)