DEV Community

Data Tech Bridge
Data Tech Bridge

Posted on

Amazon Redshift Cheat Sheet

AWS Redshift Cheat Sheet for AWS Certified Data Engineer - Associate (DEA-C01)

Core Concepts and Building Blocks

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. It's designed for high performance analytics and business intelligence workloads.

Key Components:

  1. Clusters: Collection of computing resources called nodes
  2. Nodes: Individual compute units that process queries
  3. Leader Node: Manages client connections and query planning
  4. Compute Nodes: Execute queries and store data
  5. Slices: Partitions of compute nodes where data is stored
  6. Databases: Collections of tables
  7. Workload Management (WLM): Controls query prioritization and resource allocation
  8. Redshift Spectrum: Query data directly from S3 without loading
  9. Concurrency Scaling: Automatically adds cluster capacity to handle increased demand

Redshift Architecture Mind Map

Amazon Redshift
├── Cluster Architecture
│   ├── Leader Node
│   │   ├── Query Planning
│   │   ├── Result Aggregation
│   │   └── Client Connection Management
│   └── Compute Nodes
│       ├── Data Storage
│       ├── Query Execution
│       └── Slices (Data Partitions)
├── Data Storage
│   ├── Columnar Storage
│   ├── Zone Maps
│   ├── Data Compression
│   └── Data Distribution
│       ├── Even Distribution
│       ├── Key Distribution
│       └── All Distribution
├── Query Processing
│   ├── MPP Architecture
│   ├── Query Optimization
│   └── Result Caching
└── Management Features
    ├── Workload Management (WLM)
    ├── Concurrency Scaling
    ├── AQUA (Advanced Query Accelerator)
    ├── Redshift Spectrum
    └── Automatic Table Optimization
Enter fullscreen mode Exit fullscreen mode

Node Types and Specifications

Node Type vCPU Memory Storage I/O Use Case
RA3 Nodes
ra3.16xlarge 48 384 GB Managed 4x Large data warehouses
ra3.4xlarge 12 96 GB Managed 2x Medium data warehouses
ra3.xlplus 4 32 GB Managed 1.5x Small data warehouses
DC2 Nodes
dc2.8xlarge 32 244 GB 2.56 TB SSD High Compute-intensive workloads
dc2.large 2 15 GB 160 GB SSD Moderate Small data warehouses
Serverless
Serverless Auto-scaling Auto-scaling Managed Varies Unpredictable workloads

Redshift Features and Details

Feature Description Limits/Notes
Maximum Cluster Size Up to 128 nodes (RA3) Single-node clusters available for dev/test
Maximum Storage Up to 8 PB with RA3 nodes using Redshift Managed Storage Scales automatically
Maximum Table Size Unlimited with Spectrum Limited by cluster storage for local tables
Maximum Row Size 4 MB -
Maximum Columns per Table 1,600 -
Maximum Tables per Cluster Thousands Depends on complexity
Maximum Concurrent Queries 50 by default Can be increased with concurrency scaling
Maximum Concurrent Connections 500 -
Maximum Database Size Unlimited with Spectrum Limited by cluster storage for local data
Backup Retention 1-35 days for automated snapshots Manual snapshots don't expire
Query Timeout Default 1 hour Configurable
Maintenance Window 30-minute window once per week Configurable
Encryption At-rest and in-transit KMS or HSM
VPC Support Yes Enhanced VPC routing available
Cross-Region Snapshots Yes For disaster recovery
Elastic Resize Add/remove nodes in minutes Limited to same node type
Classic Resize Change node types Takes longer than elastic resize
Concurrency Scaling Auto-scales read capacity Charged by usage (1 free hour per day)
AQUA Hardware-accelerated cache Available for RA3 nodes
Federated Query Query across databases PostgreSQL and Aurora PostgreSQL

Data Distribution Styles

Distribution Style Description Best For Performance Impact
AUTO Redshift assigns optimal distribution General use Good for most cases
EVEN Rows distributed evenly across slices Tables without clear join key Balanced storage, potential data movement during joins
KEY Rows with same values in distribution column on same slice Join tables on distribution key Minimizes data movement during joins
ALL Full copy of table on every node Small dimension tables Fast joins but storage overhead

Sort Keys

Sort Key Type Description Best For Performance Impact
Compound Sort by column order (like a phone book) Range-restricted scans on sort columns Excellent for queries filtering on prefix of sort key
Interleaved Equal weight to each sort column Queries with predicates on different columns Better for varied query patterns
Automatic Redshift chooses optimal sort key General use Good for most cases

Compression Encodings

Encoding Best For Compression Ratio Performance Impact
RAW Binary data, already compressed None Baseline
AZ64 Numeric data Good Fast computation
BYTEDICT Limited distinct values Very high Fast for small domains
DELTA Incremental numeric data High Good for dates, timestamps
LZO Very large text columns Moderate Good general purpose
ZSTD Varied data types High Good general purpose, better than LZO
RUNLENGTH Repeated values Very high Excellent for low-cardinality columns
TEXT255/TEXT32K Variable-length strings High Good for text

Important Redshift Limits and Performance Factors

  1. Maximum of 500 concurrent connections per cluster
  2. Default query timeout is 1 hour (configurable)
  3. Maximum of 50 concurrent queries by default
  4. Maximum of 100 databases per cluster
  5. Maximum of 9,900 schemas per database
  6. Maximum of 200,000 tables per cluster (including temporary tables)
  7. Maximum row size is 4 MB
  8. Maximum column name length is 127 bytes
  9. Maximum 1,600 columns per table
  10. Maximum identifier length is 127 bytes
  11. Maximum SQL statement size is 16 MB

Query Performance Optimization

  1. Use COPY command for bulk data loading (8-10x faster than INSERT)
  2. Choose appropriate distribution keys to minimize data movement
  3. Use sort keys for columns frequently used in WHERE clauses
  4. Vacuum regularly to reclaim space and resort data
  5. Analyze tables to update statistics for the query planner
  6. Use appropriate compression encodings for columns
  7. Avoid SELECT * and retrieve only needed columns
  8. Use UNLOAD to export large result sets to S3
  9. Implement proper partitioning when using Redshift Spectrum
  10. Use materialized views for common, complex queries

Data Loading Best Practices

  1. Use COPY command from S3, not INSERT statements
  2. Split large files into multiple files (1-128 MB each)
  3. Use gzip compression for load files
  4. Load data in parallel using multiple files
  5. Use a manifest file to ensure all files are loaded
  6. Use STATUPDATE ON to update statistics after loading
  7. Use COMPUPDATE ON for automatic compression analysis
  8. Temporarily disable automatic compression for very large loads
  9. Use a single COPY transaction for related tables
  10. Example COPY command:
COPY customer
FROM 's3://mybucket/customer/data/'
IAM_ROLE 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
DELIMITER '|' REGION 'us-west-2'
GZIP COMPUPDATE ON;
Enter fullscreen mode Exit fullscreen mode

Workload Management (WLM)

  1. Automatic WLM: Redshift manages query queues and memory allocation
  2. Manual WLM: Define up to 8 queues with custom settings
  3. Short Query Acceleration (SQA): Prioritizes short-running queries
  4. Concurrency scaling: Automatically adds transient clusters for read queries
  5. Query monitoring rules: Define metrics-based actions for long-running queries
  6. Query priority: Assign importance levels to different workloads
  7. User groups: Assign users to specific WLM queues
  8. Memory allocation: Control percentage of memory allocated to each queue
  9. Concurrency level: Set maximum concurrent queries per queue
  10. Timeout: Set maximum execution time per queue

Redshift Spectrum

  1. Query data directly in S3 without loading into Redshift
  2. Supports various file formats: Parquet, ORC, JSON, CSV, Avro
  3. Uses external tables defined in AWS Glue Data Catalog
  4. Scales automatically to thousands of instances
  5. Supports complex data types and nested data
  6. Partition pruning improves performance dramatically
  7. Charged separately from Redshift cluster usage
  8. Example external table creation:
CREATE EXTERNAL TABLE spectrum.sales(
  salesid INTEGER,
  listid INTEGER,
  sellerid INTEGER,
  buyerid INTEGER,
  eventid INTEGER,
  dateid INTEGER,
  qtysold INTEGER,
  pricepaid DECIMAL(8,2),
  commission DECIMAL(8,2)
)
PARTITIONED BY (saledate DATE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://mybucket/spectrum/sales/';
Enter fullscreen mode Exit fullscreen mode

Security Features

  1. VPC networking with security groups
  2. IAM integration for authentication and authorization
  3. Column-level access control
  4. Row-level security policies
  5. Dynamic data masking
  6. AWS KMS integration for encryption at rest
  7. SSL for encryption in transit
  8. CloudTrail integration for audit logging
  9. Multi-factor authentication support
  10. Integration with AWS Lake Formation for fine-grained access control

Backup and Disaster Recovery

  1. Automated snapshots (1-35 day retention)
  2. Manual snapshots (retained until deleted)
  3. Cross-region snapshot copy for disaster recovery
  4. Point-in-time recovery (up to 5-minute increments)
  5. Snapshot sharing across AWS accounts
  6. Automated snapshot schedule (every 8 hours by default)
  7. Snapshot restore to new cluster
  8. Incremental snapshots to minimize storage costs
  9. Snapshot storage in S3 (separate from cluster storage)
  10. Continuous backup to RA3 clusters

Redshift Serverless

  1. No cluster management required
  2. Automatic scaling of compute resources
  3. Pay only for what you use (RPU-seconds)
  4. Automatic pause and resume
  5. Seamless transition from provisioned clusters
  6. Same SQL interface as provisioned Redshift
  7. Integrated with Redshift Spectrum
  8. Base capacity specified in Redshift Processing Units (RPUs)
  9. Maximum capacity limits to control costs
  10. Ideal for unpredictable or intermittent workloads

Redshift Integration with AWS Services

  1. S3: Data loading, unloading, and Spectrum
  2. AWS Glue: Data catalog for Spectrum
  3. AWS DMS: Data migration to Redshift
  4. Kinesis Data Firehose: Streaming data ingestion
  5. AWS Lambda: Automated ETL and maintenance
  6. QuickSight: Business intelligence and visualization
  7. Lake Formation: Fine-grained access control
  8. CloudWatch: Monitoring and alerting
  9. CloudTrail: Audit logging
  10. AWS Secrets Manager: Credential management

Redshift vs. Open Source Alternatives

Feature Redshift PostgreSQL Apache Hive Presto
Architecture MPP, columnar SMP, row-based MPP on Hadoop MPP query engine
Scale Petabytes Terabytes Petabytes Petabytes
Performance Very high Moderate Low to moderate High for queries
Management Fully managed Self-managed Self-managed Self-managed
Cost model Pay for capacity Infrastructure cost Infrastructure cost Infrastructure cost
SQL compliance PostgreSQL 8.0 compatible Full PostgreSQL HiveQL (limited) ANSI SQL
Concurrency Limited (50+) High Limited Moderate
Use case Data warehousing OLTP, small OLAP Batch analytics Interactive queries

Important CloudWatch Metrics for Monitoring

Metric Description Threshold Action
CPUUtilization Percentage of CPU used >80% sustained Consider scaling or query optimization
PercentageDiskSpaceUsed Storage utilization >80% Resize cluster or clean up data
DatabaseConnections Active connections >80% of max Increase connection limit or optimize connection pooling
QueriesCompletedPerSecond Query throughput Baseline dependent Monitor for unexpected changes
QueryDuration Time to execute queries Baseline dependent Optimize slow queries
WLMQueueLength Queries waiting in queue >5 consistently Adjust WLM or scale cluster
WLMQueueWaitTime Time queries wait in queue >5 seconds Adjust WLM or scale cluster
ReadIOPS Read operations per second Baseline dependent Monitor for spikes or drops
WriteIOPS Write operations per second Baseline dependent Monitor for spikes or drops
ReadLatency Time for disk read operations >20ms Investigate storage issues
WriteLatency Time for disk write operations >20ms Investigate storage issues
ConcurrencyScalingActiveClusters Number of scaling clusters Cost dependent Monitor for unexpected scaling

Data Ingestion and Pipeline Replayability

  1. Use COPY command with manifest files to track loaded files
  2. Implement idempotent data loading with IDENTITY columns or natural keys
  3. Use staging tables and transactions for atomic loads
  4. Implement error handling with MAXERROR parameter in COPY
  5. Store raw data in S3 for reprocessing if needed
  6. Use Kinesis Data Firehose for streaming data ingestion
  7. Implement data validation before and after loading
  8. Use AWS Glue for ETL job orchestration
  9. Implement checkpointing in data pipelines for resumability
  10. Use AWS Step Functions for complex pipeline orchestration

Throughput and Latency Characteristics

  1. COPY command throughput: Up to several GB/s depending on cluster size
  2. Bulk loading is significantly faster than row-by-row inserts
  3. Query latency varies from milliseconds to hours depending on complexity
  4. Concurrency scaling adds read capacity within seconds
  5. Elastic resize completes within minutes
  6. Classic resize can take hours depending on data volume
  7. Vacuum operation speed depends on unsorted data percentage
  8. Redshift Spectrum queries have higher latency than local queries
  9. WLM queue wait time impacts overall query latency
  10. Result caching provides sub-second response for repeated queries

Implementing Throttling and Overcoming Rate Limits

  1. Use connection pooling to manage database connections
  2. Implement exponential backoff for API calls
  3. Use WLM to prioritize critical queries
  4. Implement client-side query queuing for high-concurrency applications
  5. Use short query acceleration for time-sensitive small queries
  6. Batch small inserts into larger COPY operations
  7. Use concurrency scaling for read-heavy workloads
  8. Implement retry logic for throttled operations
  9. Monitor and alert on queue wait times
  10. Use reserved capacity for predictable workloads

Advanced Features and Optimizations

  1. Materialized views for precomputed query results
  2. Automatic table optimization for sort and distribution keys
  3. Automatic vacuum delete for maintaining performance
  4. Automatic analyze for statistics maintenance
  5. Query monitoring rules for workload management
  6. Federated queries to access data in other databases
  7. Data sharing across Redshift clusters
  8. Machine learning integration with Amazon SageMaker
  9. Spatial data support for geospatial analytics
  10. HyperLogLog functions for cardinality estimation
  11. Time series functions for time-based analysis
  12. Window functions for advanced analytics
  13. AQUA (Advanced Query Accelerator) for RA3 nodes
  14. Cross-database queries within a cluster
  15. Semi-structured data support (SUPER data type)
  16. JSON and PartiQL support for flexible data models
  17. Stored procedures for complex logic
  18. User-defined functions (UDFs) for custom operations

Example Calculation: Cluster Sizing

  1. Storage Calculation:

    • Raw data size: 1 TB
    • Compression ratio: 3:1 (typical for Redshift)
    • Required storage: 1 TB ÷ 3 = 333 GB
    • Add 20% for sorting and intermediate results: 333 GB × 1.2 = 400 GB
    • Recommended cluster: ra3.4xlarge (managed storage)
  2. Performance Calculation:

    • Peak concurrent queries: 20
    • Target query execution time: 10 seconds
    • Required slices: 20 queries × 2 slices per query = 40 slices
    • Each ra3.4xlarge has 12 slices (3 slices per vCPU, 4 vCPUs)
    • Required nodes: 40 slices ÷ 12 slices per node = 3.33 nodes
    • Recommended cluster: 4 × ra3.4xlarge nodes
  3. Cost Calculation (example):

    • 4 × ra3.4xlarge at $3.26/hour = $13.04/hour
    • Monthly cost: $13.04 × 24 × 30 = $9,388.80
    • Managed storage: 400 GB at $0.024/GB/month = $9.60/month
    • Total monthly cost: $9,398.40
  4. Concurrency Scaling Cost:

    • 1 free concurrency scaling hour per day = 30 hours/month
    • Additional usage: 2 hours/day × 30 days = 60 hours/month
    • Billable hours: 60 - 30 = 30 hours/month
    • Cost: 30 hours × $13.04/hour = $391.20/month

Top comments (0)