AWS Redshift Cheat Sheet for AWS Certified Data Engineer - Associate (DEA-C01)
Core Concepts and Building Blocks
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. It's designed for high performance analytics and business intelligence workloads.
Key Components:
Clusters: Collection of computing resources called nodes
Nodes: Individual compute units that process queries
Leader Node: Manages client connections and query planning
Compute Nodes: Execute queries and store data
Slices: Partitions of compute nodes where data is stored
Databases: Collections of tables
Workload Management (WLM): Controls query prioritization and resource allocation
Redshift Spectrum: Query data directly from S3 without loading
Concurrency Scaling: Automatically adds cluster capacity to handle increased demand
Redshift Architecture Mind Map
Amazon Redshift
├── Cluster Architecture
│ ├── Leader Node
│ │ ├── Query Planning
│ │ ├── Result Aggregation
│ │ └── Client Connection Management
│ └── Compute Nodes
│ ├── Data Storage
│ ├── Query Execution
│ └── Slices (Data Partitions)
├── Data Storage
│ ├── Columnar Storage
│ ├── Zone Maps
│ ├── Data Compression
│ └── Data Distribution
│ ├── Even Distribution
│ ├── Key Distribution
│ └── All Distribution
├── Query Processing
│ ├── MPP Architecture
│ ├── Query Optimization
│ └── Result Caching
└── Management Features
├── Workload Management (WLM)
├── Concurrency Scaling
├── AQUA (Advanced Query Accelerator)
├── Redshift Spectrum
└── Automatic Table Optimization
Node Types and Specifications
Node Type
vCPU
Memory
Storage
I/O
Use Case
RA3 Nodes
ra3.16xlarge
48
384 GB
Managed
4x
Large data warehouses
ra3.4xlarge
12
96 GB
Managed
2x
Medium data warehouses
ra3.xlplus
4
32 GB
Managed
1.5x
Small data warehouses
DC2 Nodes
dc2.8xlarge
32
244 GB
2.56 TB SSD
High
Compute-intensive workloads
dc2.large
2
15 GB
160 GB SSD
Moderate
Small data warehouses
Serverless
Serverless
Auto-scaling
Auto-scaling
Managed
Varies
Unpredictable workloads
Redshift Features and Details
Feature
Description
Limits/Notes
Maximum Cluster Size
Up to 128 nodes (RA3)
Single-node clusters available for dev/test
Maximum Storage
Up to 8 PB with RA3 nodes using Redshift Managed Storage
Scales automatically
Maximum Table Size
Unlimited with Spectrum
Limited by cluster storage for local tables
Maximum Row Size
4 MB
-
Maximum Columns per Table
1,600
-
Maximum Tables per Cluster
Thousands
Depends on complexity
Maximum Concurrent Queries
50 by default
Can be increased with concurrency scaling
Maximum Concurrent Connections
500
-
Maximum Database Size
Unlimited with Spectrum
Limited by cluster storage for local data
Backup Retention
1-35 days for automated snapshots
Manual snapshots don't expire
Query Timeout
Default 1 hour
Configurable
Maintenance Window
30-minute window once per week
Configurable
Encryption
At-rest and in-transit
KMS or HSM
VPC Support
Yes
Enhanced VPC routing available
Cross-Region Snapshots
Yes
For disaster recovery
Elastic Resize
Add/remove nodes in minutes
Limited to same node type
Classic Resize
Change node types
Takes longer than elastic resize
Concurrency Scaling
Auto-scales read capacity
Charged by usage (1 free hour per day)
AQUA
Hardware-accelerated cache
Available for RA3 nodes
Federated Query
Query across databases
PostgreSQL and Aurora PostgreSQL
Data Distribution Styles
Distribution Style
Description
Best For
Performance Impact
AUTO
Redshift assigns optimal distribution
General use
Good for most cases
EVEN
Rows distributed evenly across slices
Tables without clear join key
Balanced storage, potential data movement during joins
KEY
Rows with same values in distribution column on same slice
Join tables on distribution key
Minimizes data movement during joins
ALL
Full copy of table on every node
Small dimension tables
Fast joins but storage overhead
Sort Keys
Sort Key Type
Description
Best For
Performance Impact
Compound
Sort by column order (like a phone book)
Range-restricted scans on sort columns
Excellent for queries filtering on prefix of sort key
Interleaved
Equal weight to each sort column
Queries with predicates on different columns
Better for varied query patterns
Automatic
Redshift chooses optimal sort key
General use
Good for most cases
Compression Encodings
Encoding
Best For
Compression Ratio
Performance Impact
RAW
Binary data, already compressed
None
Baseline
AZ64
Numeric data
Good
Fast computation
BYTEDICT
Limited distinct values
Very high
Fast for small domains
DELTA
Incremental numeric data
High
Good for dates, timestamps
LZO
Very large text columns
Moderate
Good general purpose
ZSTD
Varied data types
High
Good general purpose, better than LZO
RUNLENGTH
Repeated values
Very high
Excellent for low-cardinality columns
TEXT255/TEXT32K
Variable-length strings
High
Good for text
Important Redshift Limits and Performance Factors
Maximum of 500 concurrent connections per cluster
Default query timeout is 1 hour (configurable)
Maximum of 50 concurrent queries by default
Maximum of 100 databases per cluster
Maximum of 9,900 schemas per database
Maximum of 200,000 tables per cluster (including temporary tables)
Maximum row size is 4 MB
Maximum column name length is 127 bytes
Maximum 1,600 columns per table
Maximum identifier length is 127 bytes
Maximum SQL statement size is 16 MB
Query Performance Optimization
Use COPY command for bulk data loading (8-10x faster than INSERT)
Choose appropriate distribution keys to minimize data movement
Use sort keys for columns frequently used in WHERE clauses
Vacuum regularly to reclaim space and resort data
Analyze tables to update statistics for the query planner
Use appropriate compression encodings for columns
Avoid SELECT * and retrieve only needed columns
Use UNLOAD to export large result sets to S3
Implement proper partitioning when using Redshift Spectrum
Use materialized views for common, complex queries
Data Loading Best Practices
Use COPY command from S3, not INSERT statements
Split large files into multiple files (1-128 MB each)
Use gzip compression for load files
Load data in parallel using multiple files
Use a manifest file to ensure all files are loaded
Use STATUPDATE ON to update statistics after loading
Use COMPUPDATE ON for automatic compression analysis
Temporarily disable automatic compression for very large loads
Top comments (0)