AWS DynamoDB Cheat Sheet for AWS Certified Data Engineer - Associate (DEA-C01)
Core Concepts and Building Blocks
Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability.
Key components:
- Tables: Collection of items (similar to rows in relational databases)
- Items: Collection of attributes (similar to columns)
- Attributes: Fundamental data elements
- Primary Key: Uniquely identifies each item in a table
- Secondary Indexes: Additional access patterns beyond primary key
- Capacity Units: Measure of throughput provisioning
DynamoDB Fundamentals
-
Primary Key Types:
- Simple Primary Key: Partition key only
- Composite Primary Key: Partition key + Sort key
-
Data Types:
- Scalar: Number, String, Binary, Boolean, Null
- Document: List, Map
- Set: String Set, Number Set, Binary Set
-
Read Consistency:
- Eventually Consistent: May not reflect most recent write (default, cheaper)
- Strongly Consistent: Always reflects most recent write
- ACID Transactions: Supports atomic, consistent, isolated, durable transactions
-
Capacity Modes:
- Provisioned: Specify read/write capacity units in advance
- On-Demand: Pay-per-request, auto-scales instantly
-
Service Limits:
- Max item size: 400KB
- Max partition throughput: 3000 RCU and 1000 WCU
- Max table size: Unlimited
- Max number of tables per region: 2500 (default, can be increased)
- Max number of GSIs per table: 20
- Max number of LSIs per table: 5
Capacity Planning and Performance
Concept | Description | Formula/Example |
---|---|---|
Read Capacity Unit (RCU) | 1 strongly consistent read per second for items up to 4KB | 1 RCU = 1 strongly consistent read/sec for 4KB item |
Write Capacity Unit (WCU) | 1 write per second for items up to 1KB | 1 WCU = 1 write/sec for 1KB item |
Eventually Consistent Read | Consumes half the RCUs of strongly consistent reads | 0.5 RCU for 4KB item |
Transactional Read | Consumes 2x RCUs of strongly consistent reads | 2 RCUs for 4KB item |
Transactional Write | Consumes 2x WCUs of standard writes | 2 WCUs for 1KB item |
-
Capacity Calculation Examples:
- Reading 10 items of 8KB each per second (strongly consistent): 10 × (8KB/4KB) = 20 RCUs
- Writing 5 items of 2.5KB each per second: 5 × (2.5KB/1KB) = 13 WCUs (rounded up)
- Reading 15 items of 6KB each per second (eventually consistent): 15 × (6KB/4KB) × 0.5 = 12 RCUs (rounded up)
Adaptive Capacity: Automatically distributes workloads across partitions to handle hot keys and hot partitions.
Burst Capacity: Unused capacity is stored for up to 5 minutes to handle sudden spikes.
Secondary Indexes
Feature | Local Secondary Index (LSI) | Global Secondary Index (GSI) |
---|---|---|
Key Schema | Same partition key as base table, different sort key | Different partition key and/or sort key |
Creation | Only at table creation time | Can be added/removed anytime |
Size Limits | Shares 10GB limit per partition with base table | No size limit |
Consistency | Supports both eventual and strong consistency | Only eventual consistency |
Provisioning | Uses base table's capacity | Has its own provisioned capacity |
Projection Types | ALL, KEYS_ONLY, INCLUDE | ALL, KEYS_ONLY, INCLUDE |
-
Projection Types:
- ALL: All attributes from the base table
- KEYS_ONLY: Only the index and primary keys
- INCLUDE: Only the specified attributes
Sparse Indexes: Indexes that only contain items with the specified attributes, useful for queries on a subset of data.
Data Management and Operations
DynamoDB Streams: Captures item-level modifications in a DynamoDB table and sends to Lambda, Kinesis, or other services.
Time To Live (TTL): Automatically deletes items after a specified timestamp.
Point-in-Time Recovery (PITR): Continuous backups for the last 35 days.
On-Demand Backup: Full backups for long-term retention.
Global Tables: Multi-region, multi-active replication for global applications.
DAX (DynamoDB Accelerator): In-memory cache for DynamoDB, reduces read latency from milliseconds to microseconds.
Encryption: All data is encrypted at rest by default using AWS KMS.
Data Modeling and Access Patterns
Single-Table Design: Store multiple entity types in one table to minimize queries.
Overloading Keys: Using the same attribute for different entity types.
Composite Sort Keys: Combining multiple attributes in the sort key for hierarchical data.
GSI Overloading: Using GSIs for different access patterns on the same table.
Sparse Indexes: Only items with the indexed attribute appear in the index.
-
Sort Key Patterns:
- Hierarchical data:
department#engineering#project#dynamo
- Time-based data:
2023-05-15#12:30:45
- Version control:
v1.0.2#2023-05-15
- Hierarchical data:
API Operations
Operation Category | Key Operations | Description |
---|---|---|
Control Plane | CreateTable, UpdateTable, DeleteTable | Manage table structure and settings |
Data Plane (Item) | PutItem, GetItem, UpdateItem, DeleteItem | Single-item operations |
Data Plane (Batch) | BatchGetItem, BatchWriteItem | Multiple-item operations (up to 25 items) |
Data Plane (Query) | Query | Find items with same partition key, filter by sort key |
Data Plane (Scan) | Scan | Examine every item in a table (expensive) |
Transactions | TransactWriteItems, TransactGetItems | ACID transactions across multiple items |
PartiQL: SQL-compatible query language for DynamoDB.
Filter Expressions: Filter results client-side after a Query or Scan.
Projection Expressions: Specify which attributes to return.
Condition Expressions: Only perform writes if conditions are met.
Expression Attribute Names/Values: Placeholders for attribute names and values in expressions.
Performance Optimization
Partition Key Design: Choose high-cardinality attributes to distribute data evenly.
Avoid Hot Partitions: Distribute workload evenly across partition keys.
Use Batch Operations: BatchGetItem and BatchWriteItems for efficiency.
Parallel Scans: Split large tables into segments for faster scanning.
Sparse Indexes: Create indexes only on frequently queried attributes.
Attribute Projections: Only project needed attributes in secondary indexes.
Query Instead of Scan: Always prefer Query over Scan operations.
Page Size Optimization: Use Limit parameter and pagination for large result sets.
DAX Caching: Use DAX for read-heavy workloads to reduce latency.
Cost Optimization
Reserved Capacity: Purchase reserved capacity for predictable workloads (up to 72% savings).
Auto Scaling: Configure to adjust capacity based on utilization.
On-Demand Mode: For unpredictable workloads or development environments.
Compression: Compress large attribute values before storing.
TTL: Use TTL to automatically remove unnecessary data.
Monitor CloudWatch Metrics: Track consumed capacity to optimize provisioning.
Sparse GSIs: Minimize the number of items in GSIs to reduce costs.
Integration with Other AWS Services
Lambda: Triggers for DynamoDB Streams, processing data changes.
Kinesis: Stream DynamoDB data changes to Kinesis for real-time analytics.
S3: Export/import data between DynamoDB and S3.
Glue: ETL jobs for DynamoDB data.
Athena: Query exported DynamoDB data in S3 using SQL.
EMR: Process DynamoDB data using Hadoop ecosystem.
AppSync: GraphQL interface for DynamoDB.
Step Functions: Orchestrate workflows involving DynamoDB operations.
Data Ingestion and Processing
Bulk Loading: Use AWS Data Pipeline or custom solutions with BatchWriteItem.
Kinesis Data Firehose: Stream data into DynamoDB.
DynamoDB Streams with Lambda: Process and transform data as it's modified.
Write Sharding: Distribute writes across multiple partition keys.
Throttling Handling: Implement exponential backoff and jitter for retries.
Rate Limiting: Use client-side throttling to prevent exceeding provisioned capacity.
Replayability: Use DynamoDB Streams with Lambda to enable replay of data processing.
DynamoDB Transactions
TransactWriteItems: Atomic writes across multiple items and tables.
TransactGetItems: Atomic reads across multiple items and tables.
Idempotency: Use client tokens to make transactions idempotent.
Isolation Levels: Provides serializable isolation.
Capacity Consumption: Transactions consume 2x the normal capacity units.
DynamoDB Streams and Change Data Capture
Stream Records: Contains before and after images of modified items.
-
Stream View Types:
- KEYS_ONLY: Only key attributes
- NEW_IMAGE: The entire item after modification
- OLD_IMAGE: The entire item before modification
- NEW_AND_OLD_IMAGES: Both before and after images
Kinesis Adapter: Process DynamoDB Streams using Kinesis Client Library.
Lambda Triggers: Automatically invoke Lambda functions on stream events.
Stream Retention: Data is stored for 24 hours.
Change Data Capture (CDC): Use Streams for CDC patterns to other systems.
DynamoDB Features Comparison
Feature | Standard Tables | Global Tables |
---|---|---|
Replication | Single region | Multi-region, multi-active |
Latency | Region-specific | Local reads in each region |
Failover | Manual | Automatic |
Consistency | Both eventual and strong | Eventually consistent across regions |
Use Case | Regional applications | Global applications |
Feature | Provisioned Capacity | On-Demand Capacity |
---|---|---|
Cost Model | Pay for provisioned capacity | Pay per request |
Scaling | Auto-scaling or manual | Automatic |
Predictability | More predictable costs | More predictable performance |
Burst Handling | Limited burst capacity | Unlimited burst capacity |
Use Case | Predictable workloads | Variable/unpredictable workloads |
DynamoDB Service Features Summary
Feature | Description | Limits/Notes |
---|---|---|
Tables | Collection of items | Unlimited size, 2500 tables per region |
Items | Collection of attributes | 400KB max size per item |
Primary Key | Unique identifier | Simple or composite |
Secondary Indexes | Alternative access patterns | 20 GSIs, 5 LSIs per table |
Capacity Units | Throughput measurement | RCU (4KB reads), WCU (1KB writes) |
Consistency Models | Data retrieval consistency | Eventually or strongly consistent |
Transactions | ACID operations | 2x capacity consumption |
Streams | Change data capture | 24-hour retention |
TTL | Automatic item expiration | Based on timestamp attribute |
Backups | Data protection | On-demand and continuous (PITR) |
Global Tables | Multi-region replication | Active-active configuration |
DAX | In-memory cache | Microsecond latency |
Encryption | Data protection | Server-side encryption by default |
Auto Scaling | Automatic capacity adjustment | Target utilization percentage |
PartiQL | SQL-compatible queries | SQL-like access to NoSQL data |
Important CloudWatch Metrics for Monitoring
ConsumedReadCapacityUnits: The number of read capacity units consumed.
ConsumedWriteCapacityUnits: The number of write capacity units consumed.
ProvisionedReadCapacityUnits: The number of provisioned read capacity units.
ProvisionedWriteCapacityUnits: The number of provisioned write capacity units.
ReadThrottleEvents: Requests to DynamoDB that exceed the provisioned read capacity units.
WriteThrottleEvents: Requests to DynamoDB that exceed the provisioned write capacity units.
SuccessfulRequestLatency: The latency of successful requests to DynamoDB.
ThrottledRequests: The number of throttled requests.
SystemErrors: The number of requests that generated an error due to system issues.
UserErrors: The number of requests that generated an error due to user issues.
DynamoDB Mind Map
DynamoDB
├── Core Components
│ ├── Tables
│ ├── Items
│ ├── Attributes
│ ├── Primary Keys
│ │ ├── Partition Key
│ │ └── Sort Key
│ └── Secondary Indexes
│ ├── Local Secondary Indexes (LSI)
│ └── Global Secondary Indexes (GSI)
├── Capacity Management
│ ├── Provisioned Mode
│ │ ├── Read Capacity Units (RCU)
│ │ ├── Write Capacity Units (WCU)
│ │ └── Auto Scaling
│ └── On-Demand Mode
├── Data Access
│ ├── Single-Item Operations
│ │ ├── GetItem
│ │ ├── PutItem
│ │ ├── UpdateItem
│ │ └── DeleteItem
│ ├── Multi-Item Operations
│ │ ├── Query
│ │ ├── Scan
│ │ ├── BatchGetItem
│ │ └── BatchWriteItem
│ └── Transactions
│ ├── TransactWriteItems
│ └── TransactGetItems
├── Advanced Features
│ ├── DynamoDB Streams
│ ├── Global Tables
│ ├── Time To Live (TTL)
│ ├── Point-in-Time Recovery
│ └── DAX (DynamoDB Accelerator)
├── Data Modeling
│ ├── Single-Table Design
│ ├── Access Patterns
│ ├── Composite Keys
│ └── Sparse Indexes
└── Integration
├── Lambda
├── Kinesis
├── S3
├── Glue
└── AppSync
Throttling and Rate Limits
Throttling: Occurs when requests exceed provisioned capacity.
Exponential Backoff: Retry failed operations with increasing wait times.
Jitter: Add randomness to retry intervals to prevent thundering herd problems.
Request Rate Limiting: Client-side throttling to stay within limits.
Burst Capacity: Temporarily exceeding provisioned capacity (up to 5 minutes of unused capacity).
Adaptive Capacity: Automatically redistributes capacity to handle hot partitions.
Error Handling: Implement proper handling for ProvisionedThroughputExceededException.
Monitoring: Set up CloudWatch alarms for throttling events.
Auto Scaling: Configure to adjust capacity based on consumption patterns.
On-Demand Mode: Switch to on-demand for unpredictable workloads to avoid throttling.
Throughput and Latency Characteristics
Read Latency: Single-digit milliseconds for standard operations.
Write Latency: Single-digit milliseconds for standard operations.
DAX Read Latency: Microseconds for cached reads.
Global Tables Latency: Local reads and writes in each region.
Scan Operation Latency: Proportional to table size and item count.
Query Operation Latency: Proportional to result set size.
Batch Operation Throughput: Up to 25 items per batch operation.
Partition Throughput Limits: 3000 RCU and 1000 WCU per partition.
Transactional Operations: 2x standard latency due to two-phase commit protocol.
Strongly Consistent Reads: Higher latency than eventually consistent reads.
Replayability of Data Ingestion Pipelines
DynamoDB Streams: Capture changes for 24 hours, enabling replay of recent changes.
Lambda Event Source Mapping: Tracks position in stream, can restart from specific sequence number.
Checkpointing: Store processing position in a separate DynamoDB table.
Dead Letter Queues: Capture failed processing attempts for later replay.
S3 Export/Import: Export data to S3 for long-term storage and replay capability.
Kinesis Integration: Use Kinesis as an intermediate buffer for enhanced replay capabilities.
Idempotent Processing: Design processors to handle duplicate events safely.
Event Sourcing Pattern: Store all events to enable complete system state reconstruction.
Change Data Capture (CDC): Use DynamoDB Streams as a CDC source for other systems.
ETL Orchestration: Use Step Functions to coordinate and retry data processing workflows.
Top comments (0)