Data Tech Bridge

Posted on Mar 1

Amazon DynamoDB Cheat Sheet

AWS DynamoDB Cheat Sheet for AWS Certified Data Engineer - Associate (DEA-C01)

Core Concepts and Building Blocks

Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability.

Key components:

Tables: Collection of items (similar to rows in relational databases)
Items: Collection of attributes (similar to columns)
Attributes: Fundamental data elements
Primary Key: Uniquely identifies each item in a table
Secondary Indexes: Additional access patterns beyond primary key
Capacity Units: Measure of throughput provisioning

DynamoDB Fundamentals

Primary Key Types:
- Simple Primary Key: Partition key only
- Composite Primary Key: Partition key + Sort key
Data Types:
- Scalar: Number, String, Binary, Boolean, Null
- Document: List, Map
- Set: String Set, Number Set, Binary Set
Read Consistency:
- Eventually Consistent: May not reflect most recent write (default, cheaper)
- Strongly Consistent: Always reflects most recent write
- ACID Transactions: Supports atomic, consistent, isolated, durable transactions
Capacity Modes:
- Provisioned: Specify read/write capacity units in advance
- On-Demand: Pay-per-request, auto-scales instantly
Service Limits:
- Max item size: 400KB
- Max partition throughput: 3000 RCU and 1000 WCU
- Max table size: Unlimited
- Max number of tables per region: 2500 (default, can be increased)
- Max number of GSIs per table: 20
- Max number of LSIs per table: 5

Capacity Planning and Performance

Concept	Description	Formula/Example
Read Capacity Unit (RCU)	1 strongly consistent read per second for items up to 4KB	1 RCU = 1 strongly consistent read/sec for 4KB item
Write Capacity Unit (WCU)	1 write per second for items up to 1KB	1 WCU = 1 write/sec for 1KB item
Eventually Consistent Read	Consumes half the RCUs of strongly consistent reads	0.5 RCU for 4KB item
Transactional Read	Consumes 2x RCUs of strongly consistent reads	2 RCUs for 4KB item
Transactional Write	Consumes 2x WCUs of standard writes	2 WCUs for 1KB item

Capacity Calculation Examples:
- Reading 10 items of 8KB each per second (strongly consistent): 10 × (8KB/4KB) = 20 RCUs
- Writing 5 items of 2.5KB each per second: 5 × (2.5KB/1KB) = 13 WCUs (rounded up)
- Reading 15 items of 6KB each per second (eventually consistent): 15 × (6KB/4KB) × 0.5 = 12 RCUs (rounded up)
Adaptive Capacity: Automatically distributes workloads across partitions to handle hot keys and hot partitions.
Burst Capacity: Unused capacity is stored for up to 5 minutes to handle sudden spikes.

Secondary Indexes

Feature	Local Secondary Index (LSI)	Global Secondary Index (GSI)
Key Schema	Same partition key as base table, different sort key	Different partition key and/or sort key
Creation	Only at table creation time	Can be added/removed anytime
Size Limits	Shares 10GB limit per partition with base table	No size limit
Consistency	Supports both eventual and strong consistency	Only eventual consistency
Provisioning	Uses base table's capacity	Has its own provisioned capacity
Projection Types	ALL, KEYS_ONLY, INCLUDE	ALL, KEYS_ONLY, INCLUDE

Projection Types:
- ALL: All attributes from the base table
- KEYS_ONLY: Only the index and primary keys
- INCLUDE: Only the specified attributes
Sparse Indexes: Indexes that only contain items with the specified attributes, useful for queries on a subset of data.

Data Management and Operations

DynamoDB Streams: Captures item-level modifications in a DynamoDB table and sends to Lambda, Kinesis, or other services.
Time To Live (TTL): Automatically deletes items after a specified timestamp.
Point-in-Time Recovery (PITR): Continuous backups for the last 35 days.
On-Demand Backup: Full backups for long-term retention.
Global Tables: Multi-region, multi-active replication for global applications.
DAX (DynamoDB Accelerator): In-memory cache for DynamoDB, reduces read latency from milliseconds to microseconds.
Encryption: All data is encrypted at rest by default using AWS KMS.

Data Modeling and Access Patterns

Single-Table Design: Store multiple entity types in one table to minimize queries.
Overloading Keys: Using the same attribute for different entity types.
Composite Sort Keys: Combining multiple attributes in the sort key for hierarchical data.
GSI Overloading: Using GSIs for different access patterns on the same table.
Sparse Indexes: Only items with the indexed attribute appear in the index.
Sort Key Patterns:
- Hierarchical data: department#engineering#project#dynamo
- Time-based data: 2023-05-15#12:30:45
- Version control: v1.0.2#2023-05-15

API Operations

Operation Category	Key Operations	Description
Control Plane	CreateTable, UpdateTable, DeleteTable	Manage table structure and settings
Data Plane (Item)	PutItem, GetItem, UpdateItem, DeleteItem	Single-item operations
Data Plane (Batch)	BatchGetItem, BatchWriteItem	Multiple-item operations (up to 25 items)
Data Plane (Query)	Query	Find items with same partition key, filter by sort key
Data Plane (Scan)	Scan	Examine every item in a table (expensive)
Transactions	TransactWriteItems, TransactGetItems	ACID transactions across multiple items

PartiQL: SQL-compatible query language for DynamoDB.
Filter Expressions: Filter results client-side after a Query or Scan.
Projection Expressions: Specify which attributes to return.
Condition Expressions: Only perform writes if conditions are met.
Expression Attribute Names/Values: Placeholders for attribute names and values in expressions.

Performance Optimization

Partition Key Design: Choose high-cardinality attributes to distribute data evenly.
Avoid Hot Partitions: Distribute workload evenly across partition keys.
Use Batch Operations: BatchGetItem and BatchWriteItems for efficiency.
Parallel Scans: Split large tables into segments for faster scanning.
Sparse Indexes: Create indexes only on frequently queried attributes.
Attribute Projections: Only project needed attributes in secondary indexes.
Query Instead of Scan: Always prefer Query over Scan operations.
Page Size Optimization: Use Limit parameter and pagination for large result sets.
DAX Caching: Use DAX for read-heavy workloads to reduce latency.

Cost Optimization

Reserved Capacity: Purchase reserved capacity for predictable workloads (up to 72% savings).
Auto Scaling: Configure to adjust capacity based on utilization.
On-Demand Mode: For unpredictable workloads or development environments.
Compression: Compress large attribute values before storing.
TTL: Use TTL to automatically remove unnecessary data.
Monitor CloudWatch Metrics: Track consumed capacity to optimize provisioning.
Sparse GSIs: Minimize the number of items in GSIs to reduce costs.

Integration with Other AWS Services

Lambda: Triggers for DynamoDB Streams, processing data changes.
Kinesis: Stream DynamoDB data changes to Kinesis for real-time analytics.
S3: Export/import data between DynamoDB and S3.
Glue: ETL jobs for DynamoDB data.
Athena: Query exported DynamoDB data in S3 using SQL.
EMR: Process DynamoDB data using Hadoop ecosystem.
AppSync: GraphQL interface for DynamoDB.
Step Functions: Orchestrate workflows involving DynamoDB operations.

Data Ingestion and Processing

Bulk Loading: Use AWS Data Pipeline or custom solutions with BatchWriteItem.
Kinesis Data Firehose: Stream data into DynamoDB.
DynamoDB Streams with Lambda: Process and transform data as it's modified.
Write Sharding: Distribute writes across multiple partition keys.
Throttling Handling: Implement exponential backoff and jitter for retries.
Rate Limiting: Use client-side throttling to prevent exceeding provisioned capacity.
Replayability: Use DynamoDB Streams with Lambda to enable replay of data processing.

DynamoDB Transactions

TransactWriteItems: Atomic writes across multiple items and tables.
TransactGetItems: Atomic reads across multiple items and tables.
Idempotency: Use client tokens to make transactions idempotent.
Isolation Levels: Provides serializable isolation.
Capacity Consumption: Transactions consume 2x the normal capacity units.

DynamoDB Streams and Change Data Capture

Stream Records: Contains before and after images of modified items.
Stream View Types:
- KEYS_ONLY: Only key attributes
- NEW_IMAGE: The entire item after modification
- OLD_IMAGE: The entire item before modification
- NEW_AND_OLD_IMAGES: Both before and after images
Kinesis Adapter: Process DynamoDB Streams using Kinesis Client Library.
Lambda Triggers: Automatically invoke Lambda functions on stream events.
Stream Retention: Data is stored for 24 hours.
Change Data Capture (CDC): Use Streams for CDC patterns to other systems.

DynamoDB Features Comparison

Feature	Standard Tables	Global Tables
Replication	Single region	Multi-region, multi-active
Latency	Region-specific	Local reads in each region
Failover	Manual	Automatic
Consistency	Both eventual and strong	Eventually consistent across regions
Use Case	Regional applications	Global applications

Feature	Provisioned Capacity	On-Demand Capacity
Cost Model	Pay for provisioned capacity	Pay per request
Scaling	Auto-scaling or manual	Automatic
Predictability	More predictable costs	More predictable performance
Burst Handling	Limited burst capacity	Unlimited burst capacity
Use Case	Predictable workloads	Variable/unpredictable workloads

DynamoDB Service Features Summary

Feature	Description	Limits/Notes
Tables	Collection of items	Unlimited size, 2500 tables per region
Items	Collection of attributes	400KB max size per item
Primary Key	Unique identifier	Simple or composite
Secondary Indexes	Alternative access patterns	20 GSIs, 5 LSIs per table
Capacity Units	Throughput measurement	RCU (4KB reads), WCU (1KB writes)
Consistency Models	Data retrieval consistency	Eventually or strongly consistent
Transactions	ACID operations	2x capacity consumption
Streams	Change data capture	24-hour retention
TTL	Automatic item expiration	Based on timestamp attribute
Backups	Data protection	On-demand and continuous (PITR)
Global Tables	Multi-region replication	Active-active configuration
DAX	In-memory cache	Microsecond latency
Encryption	Data protection	Server-side encryption by default
Auto Scaling	Automatic capacity adjustment	Target utilization percentage
PartiQL	SQL-compatible queries	SQL-like access to NoSQL data

Important CloudWatch Metrics for Monitoring

ConsumedReadCapacityUnits: The number of read capacity units consumed.
ConsumedWriteCapacityUnits: The number of write capacity units consumed.
ProvisionedReadCapacityUnits: The number of provisioned read capacity units.
ProvisionedWriteCapacityUnits: The number of provisioned write capacity units.
ReadThrottleEvents: Requests to DynamoDB that exceed the provisioned read capacity units.
WriteThrottleEvents: Requests to DynamoDB that exceed the provisioned write capacity units.
SuccessfulRequestLatency: The latency of successful requests to DynamoDB.
ThrottledRequests: The number of throttled requests.
SystemErrors: The number of requests that generated an error due to system issues.
UserErrors: The number of requests that generated an error due to user issues.

DynamoDB Mind Map

DynamoDB
├── Core Components
│   ├── Tables
│   ├── Items
│   ├── Attributes
│   ├── Primary Keys
│   │   ├── Partition Key
│   │   └── Sort Key
│   └── Secondary Indexes
│       ├── Local Secondary Indexes (LSI)
│       └── Global Secondary Indexes (GSI)
├── Capacity Management
│   ├── Provisioned Mode
│   │   ├── Read Capacity Units (RCU)
│   │   ├── Write Capacity Units (WCU)
│   │   └── Auto Scaling
│   └── On-Demand Mode
├── Data Access
│   ├── Single-Item Operations
│   │   ├── GetItem
│   │   ├── PutItem
│   │   ├── UpdateItem
│   │   └── DeleteItem
│   ├── Multi-Item Operations
│   │   ├── Query
│   │   ├── Scan
│   │   ├── BatchGetItem
│   │   └── BatchWriteItem
│   └── Transactions
│       ├── TransactWriteItems
│       └── TransactGetItems
├── Advanced Features
│   ├── DynamoDB Streams
│   ├── Global Tables
│   ├── Time To Live (TTL)
│   ├── Point-in-Time Recovery
│   └── DAX (DynamoDB Accelerator)
├── Data Modeling
│   ├── Single-Table Design
│   ├── Access Patterns
│   ├── Composite Keys
│   └── Sparse Indexes
└── Integration
    ├── Lambda
    ├── Kinesis
    ├── S3
    ├── Glue
    └── AppSync

Throttling and Rate Limits

Throttling: Occurs when requests exceed provisioned capacity.
Exponential Backoff: Retry failed operations with increasing wait times.
Jitter: Add randomness to retry intervals to prevent thundering herd problems.
Request Rate Limiting: Client-side throttling to stay within limits.
Burst Capacity: Temporarily exceeding provisioned capacity (up to 5 minutes of unused capacity).
Adaptive Capacity: Automatically redistributes capacity to handle hot partitions.
Error Handling: Implement proper handling for ProvisionedThroughputExceededException.
Monitoring: Set up CloudWatch alarms for throttling events.
Auto Scaling: Configure to adjust capacity based on consumption patterns.
On-Demand Mode: Switch to on-demand for unpredictable workloads to avoid throttling.

Throughput and Latency Characteristics

Read Latency: Single-digit milliseconds for standard operations.
Write Latency: Single-digit milliseconds for standard operations.
DAX Read Latency: Microseconds for cached reads.
Global Tables Latency: Local reads and writes in each region.
Scan Operation Latency: Proportional to table size and item count.
Query Operation Latency: Proportional to result set size.
Batch Operation Throughput: Up to 25 items per batch operation.
Partition Throughput Limits: 3000 RCU and 1000 WCU per partition.
Transactional Operations: 2x standard latency due to two-phase commit protocol.
Strongly Consistent Reads: Higher latency than eventually consistent reads.

Replayability of Data Ingestion Pipelines

DynamoDB Streams: Capture changes for 24 hours, enabling replay of recent changes.
Lambda Event Source Mapping: Tracks position in stream, can restart from specific sequence number.
Checkpointing: Store processing position in a separate DynamoDB table.
Dead Letter Queues: Capture failed processing attempts for later replay.
S3 Export/Import: Export data to S3 for long-term storage and replay capability.
Kinesis Integration: Use Kinesis as an intermediate buffer for enhanced replay capabilities.
Idempotent Processing: Design processors to handle duplicate events safely.
Event Sourcing Pattern: Store all events to enable complete system state reconstruction.
Change Data Capture (CDC): Use DynamoDB Streams as a CDC source for other systems.
ETL Orchestration: Use Step Functions to coordinate and retry data processing workflows.

DEV Community