DEV Community

Data Tech Bridge
Data Tech Bridge

Posted on

Amazon DynamoDB Cheat Sheet

AWS DynamoDB Cheat Sheet for AWS Certified Data Engineer - Associate (DEA-C01)

Core Concepts and Building Blocks

Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability.

Key components:

  • Tables: Collection of items (similar to rows in relational databases)
  • Items: Collection of attributes (similar to columns)
  • Attributes: Fundamental data elements
  • Primary Key: Uniquely identifies each item in a table
  • Secondary Indexes: Additional access patterns beyond primary key
  • Capacity Units: Measure of throughput provisioning

DynamoDB Fundamentals

  1. Primary Key Types:

    • Simple Primary Key: Partition key only
    • Composite Primary Key: Partition key + Sort key
  2. Data Types:

    • Scalar: Number, String, Binary, Boolean, Null
    • Document: List, Map
    • Set: String Set, Number Set, Binary Set
  3. Read Consistency:

    • Eventually Consistent: May not reflect most recent write (default, cheaper)
    • Strongly Consistent: Always reflects most recent write
    • ACID Transactions: Supports atomic, consistent, isolated, durable transactions
  4. Capacity Modes:

    • Provisioned: Specify read/write capacity units in advance
    • On-Demand: Pay-per-request, auto-scales instantly
  5. Service Limits:

    • Max item size: 400KB
    • Max partition throughput: 3000 RCU and 1000 WCU
    • Max table size: Unlimited
    • Max number of tables per region: 2500 (default, can be increased)
    • Max number of GSIs per table: 20
    • Max number of LSIs per table: 5

Capacity Planning and Performance

Concept Description Formula/Example
Read Capacity Unit (RCU) 1 strongly consistent read per second for items up to 4KB 1 RCU = 1 strongly consistent read/sec for 4KB item
Write Capacity Unit (WCU) 1 write per second for items up to 1KB 1 WCU = 1 write/sec for 1KB item
Eventually Consistent Read Consumes half the RCUs of strongly consistent reads 0.5 RCU for 4KB item
Transactional Read Consumes 2x RCUs of strongly consistent reads 2 RCUs for 4KB item
Transactional Write Consumes 2x WCUs of standard writes 2 WCUs for 1KB item
  1. Capacity Calculation Examples:

    • Reading 10 items of 8KB each per second (strongly consistent): 10 × (8KB/4KB) = 20 RCUs
    • Writing 5 items of 2.5KB each per second: 5 × (2.5KB/1KB) = 13 WCUs (rounded up)
    • Reading 15 items of 6KB each per second (eventually consistent): 15 × (6KB/4KB) × 0.5 = 12 RCUs (rounded up)
  2. Adaptive Capacity: Automatically distributes workloads across partitions to handle hot keys and hot partitions.

  3. Burst Capacity: Unused capacity is stored for up to 5 minutes to handle sudden spikes.

Secondary Indexes

Feature Local Secondary Index (LSI) Global Secondary Index (GSI)
Key Schema Same partition key as base table, different sort key Different partition key and/or sort key
Creation Only at table creation time Can be added/removed anytime
Size Limits Shares 10GB limit per partition with base table No size limit
Consistency Supports both eventual and strong consistency Only eventual consistency
Provisioning Uses base table's capacity Has its own provisioned capacity
Projection Types ALL, KEYS_ONLY, INCLUDE ALL, KEYS_ONLY, INCLUDE
  1. Projection Types:

    • ALL: All attributes from the base table
    • KEYS_ONLY: Only the index and primary keys
    • INCLUDE: Only the specified attributes
  2. Sparse Indexes: Indexes that only contain items with the specified attributes, useful for queries on a subset of data.

Data Management and Operations

  1. DynamoDB Streams: Captures item-level modifications in a DynamoDB table and sends to Lambda, Kinesis, or other services.

  2. Time To Live (TTL): Automatically deletes items after a specified timestamp.

  3. Point-in-Time Recovery (PITR): Continuous backups for the last 35 days.

  4. On-Demand Backup: Full backups for long-term retention.

  5. Global Tables: Multi-region, multi-active replication for global applications.

  6. DAX (DynamoDB Accelerator): In-memory cache for DynamoDB, reduces read latency from milliseconds to microseconds.

  7. Encryption: All data is encrypted at rest by default using AWS KMS.

Data Modeling and Access Patterns

  1. Single-Table Design: Store multiple entity types in one table to minimize queries.

  2. Overloading Keys: Using the same attribute for different entity types.

  3. Composite Sort Keys: Combining multiple attributes in the sort key for hierarchical data.

  4. GSI Overloading: Using GSIs for different access patterns on the same table.

  5. Sparse Indexes: Only items with the indexed attribute appear in the index.

  6. Sort Key Patterns:

    • Hierarchical data: department#engineering#project#dynamo
    • Time-based data: 2023-05-15#12:30:45
    • Version control: v1.0.2#2023-05-15

API Operations

Operation Category Key Operations Description
Control Plane CreateTable, UpdateTable, DeleteTable Manage table structure and settings
Data Plane (Item) PutItem, GetItem, UpdateItem, DeleteItem Single-item operations
Data Plane (Batch) BatchGetItem, BatchWriteItem Multiple-item operations (up to 25 items)
Data Plane (Query) Query Find items with same partition key, filter by sort key
Data Plane (Scan) Scan Examine every item in a table (expensive)
Transactions TransactWriteItems, TransactGetItems ACID transactions across multiple items
  1. PartiQL: SQL-compatible query language for DynamoDB.

  2. Filter Expressions: Filter results client-side after a Query or Scan.

  3. Projection Expressions: Specify which attributes to return.

  4. Condition Expressions: Only perform writes if conditions are met.

  5. Expression Attribute Names/Values: Placeholders for attribute names and values in expressions.

Performance Optimization

  1. Partition Key Design: Choose high-cardinality attributes to distribute data evenly.

  2. Avoid Hot Partitions: Distribute workload evenly across partition keys.

  3. Use Batch Operations: BatchGetItem and BatchWriteItems for efficiency.

  4. Parallel Scans: Split large tables into segments for faster scanning.

  5. Sparse Indexes: Create indexes only on frequently queried attributes.

  6. Attribute Projections: Only project needed attributes in secondary indexes.

  7. Query Instead of Scan: Always prefer Query over Scan operations.

  8. Page Size Optimization: Use Limit parameter and pagination for large result sets.

  9. DAX Caching: Use DAX for read-heavy workloads to reduce latency.

Cost Optimization

  1. Reserved Capacity: Purchase reserved capacity for predictable workloads (up to 72% savings).

  2. Auto Scaling: Configure to adjust capacity based on utilization.

  3. On-Demand Mode: For unpredictable workloads or development environments.

  4. Compression: Compress large attribute values before storing.

  5. TTL: Use TTL to automatically remove unnecessary data.

  6. Monitor CloudWatch Metrics: Track consumed capacity to optimize provisioning.

  7. Sparse GSIs: Minimize the number of items in GSIs to reduce costs.

Integration with Other AWS Services

  1. Lambda: Triggers for DynamoDB Streams, processing data changes.

  2. Kinesis: Stream DynamoDB data changes to Kinesis for real-time analytics.

  3. S3: Export/import data between DynamoDB and S3.

  4. Glue: ETL jobs for DynamoDB data.

  5. Athena: Query exported DynamoDB data in S3 using SQL.

  6. EMR: Process DynamoDB data using Hadoop ecosystem.

  7. AppSync: GraphQL interface for DynamoDB.

  8. Step Functions: Orchestrate workflows involving DynamoDB operations.

Data Ingestion and Processing

  1. Bulk Loading: Use AWS Data Pipeline or custom solutions with BatchWriteItem.

  2. Kinesis Data Firehose: Stream data into DynamoDB.

  3. DynamoDB Streams with Lambda: Process and transform data as it's modified.

  4. Write Sharding: Distribute writes across multiple partition keys.

  5. Throttling Handling: Implement exponential backoff and jitter for retries.

  6. Rate Limiting: Use client-side throttling to prevent exceeding provisioned capacity.

  7. Replayability: Use DynamoDB Streams with Lambda to enable replay of data processing.

DynamoDB Transactions

  1. TransactWriteItems: Atomic writes across multiple items and tables.

  2. TransactGetItems: Atomic reads across multiple items and tables.

  3. Idempotency: Use client tokens to make transactions idempotent.

  4. Isolation Levels: Provides serializable isolation.

  5. Capacity Consumption: Transactions consume 2x the normal capacity units.

DynamoDB Streams and Change Data Capture

  1. Stream Records: Contains before and after images of modified items.

  2. Stream View Types:

    • KEYS_ONLY: Only key attributes
    • NEW_IMAGE: The entire item after modification
    • OLD_IMAGE: The entire item before modification
    • NEW_AND_OLD_IMAGES: Both before and after images
  3. Kinesis Adapter: Process DynamoDB Streams using Kinesis Client Library.

  4. Lambda Triggers: Automatically invoke Lambda functions on stream events.

  5. Stream Retention: Data is stored for 24 hours.

  6. Change Data Capture (CDC): Use Streams for CDC patterns to other systems.

DynamoDB Features Comparison

Feature Standard Tables Global Tables
Replication Single region Multi-region, multi-active
Latency Region-specific Local reads in each region
Failover Manual Automatic
Consistency Both eventual and strong Eventually consistent across regions
Use Case Regional applications Global applications
Feature Provisioned Capacity On-Demand Capacity
Cost Model Pay for provisioned capacity Pay per request
Scaling Auto-scaling or manual Automatic
Predictability More predictable costs More predictable performance
Burst Handling Limited burst capacity Unlimited burst capacity
Use Case Predictable workloads Variable/unpredictable workloads

DynamoDB Service Features Summary

Feature Description Limits/Notes
Tables Collection of items Unlimited size, 2500 tables per region
Items Collection of attributes 400KB max size per item
Primary Key Unique identifier Simple or composite
Secondary Indexes Alternative access patterns 20 GSIs, 5 LSIs per table
Capacity Units Throughput measurement RCU (4KB reads), WCU (1KB writes)
Consistency Models Data retrieval consistency Eventually or strongly consistent
Transactions ACID operations 2x capacity consumption
Streams Change data capture 24-hour retention
TTL Automatic item expiration Based on timestamp attribute
Backups Data protection On-demand and continuous (PITR)
Global Tables Multi-region replication Active-active configuration
DAX In-memory cache Microsecond latency
Encryption Data protection Server-side encryption by default
Auto Scaling Automatic capacity adjustment Target utilization percentage
PartiQL SQL-compatible queries SQL-like access to NoSQL data

Important CloudWatch Metrics for Monitoring

  1. ConsumedReadCapacityUnits: The number of read capacity units consumed.

  2. ConsumedWriteCapacityUnits: The number of write capacity units consumed.

  3. ProvisionedReadCapacityUnits: The number of provisioned read capacity units.

  4. ProvisionedWriteCapacityUnits: The number of provisioned write capacity units.

  5. ReadThrottleEvents: Requests to DynamoDB that exceed the provisioned read capacity units.

  6. WriteThrottleEvents: Requests to DynamoDB that exceed the provisioned write capacity units.

  7. SuccessfulRequestLatency: The latency of successful requests to DynamoDB.

  8. ThrottledRequests: The number of throttled requests.

  9. SystemErrors: The number of requests that generated an error due to system issues.

  10. UserErrors: The number of requests that generated an error due to user issues.

DynamoDB Mind Map

DynamoDB
├── Core Components
│   ├── Tables
│   ├── Items
│   ├── Attributes
│   ├── Primary Keys
│   │   ├── Partition Key
│   │   └── Sort Key
│   └── Secondary Indexes
│       ├── Local Secondary Indexes (LSI)
│       └── Global Secondary Indexes (GSI)
├── Capacity Management
│   ├── Provisioned Mode
│   │   ├── Read Capacity Units (RCU)
│   │   ├── Write Capacity Units (WCU)
│   │   └── Auto Scaling
│   └── On-Demand Mode
├── Data Access
│   ├── Single-Item Operations
│   │   ├── GetItem
│   │   ├── PutItem
│   │   ├── UpdateItem
│   │   └── DeleteItem
│   ├── Multi-Item Operations
│   │   ├── Query
│   │   ├── Scan
│   │   ├── BatchGetItem
│   │   └── BatchWriteItem
│   └── Transactions
│       ├── TransactWriteItems
│       └── TransactGetItems
├── Advanced Features
│   ├── DynamoDB Streams
│   ├── Global Tables
│   ├── Time To Live (TTL)
│   ├── Point-in-Time Recovery
│   └── DAX (DynamoDB Accelerator)
├── Data Modeling
│   ├── Single-Table Design
│   ├── Access Patterns
│   ├── Composite Keys
│   └── Sparse Indexes
└── Integration
    ├── Lambda
    ├── Kinesis
    ├── S3
    ├── Glue
    └── AppSync
Enter fullscreen mode Exit fullscreen mode

Throttling and Rate Limits

  1. Throttling: Occurs when requests exceed provisioned capacity.

  2. Exponential Backoff: Retry failed operations with increasing wait times.

  3. Jitter: Add randomness to retry intervals to prevent thundering herd problems.

  4. Request Rate Limiting: Client-side throttling to stay within limits.

  5. Burst Capacity: Temporarily exceeding provisioned capacity (up to 5 minutes of unused capacity).

  6. Adaptive Capacity: Automatically redistributes capacity to handle hot partitions.

  7. Error Handling: Implement proper handling for ProvisionedThroughputExceededException.

  8. Monitoring: Set up CloudWatch alarms for throttling events.

  9. Auto Scaling: Configure to adjust capacity based on consumption patterns.

  10. On-Demand Mode: Switch to on-demand for unpredictable workloads to avoid throttling.

Throughput and Latency Characteristics

  1. Read Latency: Single-digit milliseconds for standard operations.

  2. Write Latency: Single-digit milliseconds for standard operations.

  3. DAX Read Latency: Microseconds for cached reads.

  4. Global Tables Latency: Local reads and writes in each region.

  5. Scan Operation Latency: Proportional to table size and item count.

  6. Query Operation Latency: Proportional to result set size.

  7. Batch Operation Throughput: Up to 25 items per batch operation.

  8. Partition Throughput Limits: 3000 RCU and 1000 WCU per partition.

  9. Transactional Operations: 2x standard latency due to two-phase commit protocol.

  10. Strongly Consistent Reads: Higher latency than eventually consistent reads.

Replayability of Data Ingestion Pipelines

  1. DynamoDB Streams: Capture changes for 24 hours, enabling replay of recent changes.

  2. Lambda Event Source Mapping: Tracks position in stream, can restart from specific sequence number.

  3. Checkpointing: Store processing position in a separate DynamoDB table.

  4. Dead Letter Queues: Capture failed processing attempts for later replay.

  5. S3 Export/Import: Export data to S3 for long-term storage and replay capability.

  6. Kinesis Integration: Use Kinesis as an intermediate buffer for enhanced replay capabilities.

  7. Idempotent Processing: Design processors to handle duplicate events safely.

  8. Event Sourcing Pattern: Store all events to enable complete system state reconstruction.

  9. Change Data Capture (CDC): Use DynamoDB Streams as a CDC source for other systems.

  10. ETL Orchestration: Use Step Functions to coordinate and retry data processing workflows.

Top comments (0)