Introduction
- This lecture covers data warehouses, with a focus on BigQuery.
- Topics include OLAP vs. OLTP, data warehouse architecture, BigQuery features, cost considerations, partitioning, clustering, and query optimization.
1. OLAP vs. OLTP
Definition and Use Cases
-
OLTP (Online Transaction Processing)
- Used in backend services and transactional databases.
- Supports quick, small updates.
- Data is highly normalized for efficiency.
- Example: Online shopping transactions.
-
OLAP (Online Analytical Processing)
- Used for analytical queries and reporting.
- Stores large datasets, updated periodically.
- Data is denormalized to enhance query performance.
- Example: Business intelligence reports for executives.
2. What is a Data Warehouse?
- A centralized repository used for reporting and analysis.
- Consists of:
- Raw data
- Metadata
- Summarized data
-
Architecture:
- Data is sourced from OLTP systems, flat files, operational databases, etc.
- A staging area processes data before it is stored in the warehouse.
- Data can be accessed through data marts tailored for specific users (e.g., sales, inventory).
- Analysts may access data marts, whereas data scientists may work directly with raw data.
3. BigQuery Overview
What is BigQuery?
- A fully managed, serverless data warehouse by Google Cloud.
- No need for infrastructure management.
- Scalable from GBs to PBs of data.
- Supports SQL-based machine learning (BigQuery ML), geospatial data, and business intelligence (BI).
Storage Model
-
Decouples compute and storage:
- Compute engine analyzes data stored externally.
- This reduces cost and increases scalability.
4. BigQuery Cost Model
Pricing Models
-
On-Demand Pricing
- Pay per data processed ($5 per TB).
- Ideal for smaller or variable workloads.
-
Flat-Rate Pricing
- Pre-purchased slots (100 slots ≈ $2,000/month).
- Recommended for workloads >200 TB/month.
Cost Optimization Strategies
- Avoid unnecessary queries.
- Use partitioning and clustering.
- Leverage cached query results.
5. BigQuery External Tables
- Query data stored in Google Cloud Storage (GCS) without importing.
- Schema can be automatically inferred.
- Limitation: BigQuery cannot estimate table size or row count before querying.
6. Partitioning in BigQuery
Why Partition Data?
- Improves query performance and reduces cost by scanning only relevant data.
- Common partitioning strategies:
- Date-based (e.g., timestamp columns)
- Integer-based (e.g., user IDs)
Example: Query Optimization with Partitioning
- Querying non-partitioned data (June 2019) processes 1.6 GB.
- Querying partitioned data (same period) processes only 106 MB.
7. Clustering in BigQuery
What is Clustering?
- Organizes data within partitions for further optimization.
- Common clustering keys: Vendor ID, user ID, country, category, etc.
Example: Partition vs. Partition + Clustering
- Partition-only Query (June 2019 – Dec 2020) processed 1.1 GB.
- Partition + Cluster Query (same period) processed 843 MB.
- Result: Faster queries and lower costs.
8. Query Optimization in BigQuery
Best Practices
- Use partitioning and clustering.
- Filter data before aggregation.
- *Avoid SELECT ** when unnecessary.
- Use BI tools (e.g., Data Studio) for visualization.
Performance Monitoring
- Check query execution plan.
- Monitor processed data size.
- Use BigQuery's information schema to analyze query behavior.
9. Summary
- BigQuery is a powerful, serverless data warehouse.
- Partitioning and clustering help reduce costs and improve performance.
- On-demand pricing is flexible, while flat-rate pricing suits high-volume users.
- External tables allow querying data from GCS without importing.
- Optimizing queries reduces costs and enhances speed.
Further Reading & Resources
- Google Cloud BigQuery Documentation: https://cloud.google.com/bigquery/docs
- BigQuery Pricing: https://cloud.google.com/bigquery/pricing
- BigQuery SQL Reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax
Top comments (0)