DEV Community

Cover image for Study Notes 3.2.2: BigQuery Internal Architecture
Pizofreude
Pizofreude

Posted on

Study Notes 3.2.2: BigQuery Internal Architecture

Core Components

1. Storage (Colossus)

  • Uses columnar storage format
  • Separated from compute resources
  • Highly cost-effective for data storage
  • Cost optimization: Only pay for storage when data is at rest

2. Network Infrastructure (Jupiter)

  • High-speed internal network within BigQuery data centers
  • Bandwidth: ~1 terabyte per second
  • Enables efficient communication between separated compute and storage
  • Critical for maintaining low query latency

3. Query Engine (Dremel)

  • Handles query execution and processing
  • Uses tree-based architecture for query distribution
  • Breaks down complex queries into smaller subqueries
  • Components:
    • Root server: Initial query reception and planning
    • Mixers: Query subdivision and result aggregation
    • Leaf nodes: Direct data access and basic operations

BiqQuery Internal Architecture

Storage Architecture

Column-Oriented vs Record-Oriented Storage

  1. Record-Oriented (Traditional)
    • Similar to CSV structure
    • Data stored row by row
    • Better for full record retrieval
  2. Column-Oriented (BigQuery's Approach)
    • Data stored by columns
    • Advantages:
      • Improved column-based aggregations
      • Efficient for queries accessing subset of columns
      • Better compression and performance

GCS Data Storage Architecture

Query Processing Workflow

  1. Query Submission
    • Root server receives query
    • Initial query analysis and planning
  2. Query Distribution
    • Root server breaks down query into sub-modules
    • Mixers further divide into smaller operations
    • Leaf nodes receive specific tasks
  3. Data Processing
    • Leaf nodes communicate with Colossus
    • Execute assigned operations
    • Return partial results to mixers
  4. Result Aggregation
    • Mixers combine results from leaf nodes
    • Root server performs final aggregation
    • Returns complete result set

Query processing workflow

Key Benefits

  1. Performance
    • Distributed query processing
    • High-speed network infrastructure
    • Efficient columnar storage
  2. Cost Efficiency
    • Separated storage and compute
    • Pay primarily for query processing
    • Economical data storage
  3. Scalability
    • Distributed architecture
    • Efficient handling of large datasets
    • Automatic resource management

Best Practices Note

While understanding internals isn't mandatory for basic usage, it can be valuable for:

  • Building optimized data products
  • Making informed architectural decisions
  • Understanding performance characteristics
  • Implementing cost-effective solutions

This architecture enables BigQuery to handle massive datasets efficiently while maintaining quick query response times through its distributed processing approach.

Top comments (0)