Siddhartha Varma

Posted on Feb 5

A deep dive into Dynamo's architecture and scale

#dynamodb #aws #systemdesign

Intro

On Prime Day 2021, Amazon's systems made trillions of API calls to DynamoDB, peaking at 89.2M requests per second (RPS), all while maintaining single digit millisecond latency. Yet, many developers struggle with database scaling issues that DynamoDB was created to solve.

Having worked extensively with DynamoDB at Zomato , I have also hit the same roadblocks, like hot partitions throttling applications, costs spiraling due to over-provisioning and data models that worked perfectly in other databases suddenly becoming performance bottlenecks. These challenges often stem from approaching DynamoDB with a traditional database mindset.

In this article, I'll dive into DynamoDB's architecture through a system design perspective. We'll explore the fundamentals of dynamo, how it's scaling and performance management differs from traditional databases, and more importantly - how to architect your applications to harness the full power of dynamo. Whether you're debugging performance issues, optimizing costs, or just starting with DynamoDB, understanding these internals will help you make better design decisions.

Understanding DynamoDB's core principles

At the heart of DynamoDB's robustness and scalability lie six core principles around which it was architected. These principles prioritize simplicity, predictability, and reliability. Let's explore each, starting with perhaps the most counterintuitive

1. Predictability over peak performance

This sounds very counterintuitive from a software perspective. Shouldn't one chase maximum throughput? While other databases also follow this approach, DynamoDB takes a contrarian stance - predictable performance matters more than peak performance.

If you marinate on this, it does make a lot of sense from a database perspective. What good is a lightning fast database which occasionally hangs for several seconds - even milliseconds? The latency spikes will ultimately affect application performance. DynamoDB's architects recognized this early on and set an ambitious goal - complete all requests with sub 10ms latencies.

How does it achieve this? Through some clever architectural decisions:

Global Admission Control to manage capacity across entire tables, preventing any single partitioning from becoming a bottleneck
Automatic partitioning to distribute I/O load evenly, even as access patterns change
Proactive monitoring and rebalancing to prevent resource exhaustion before it impacts performance
Multi-tenant architecture with strict resource isolation ensures consistent performance even under varied workloads

The result - Once properly configured, applications can rely on consistent performance whether they're handling normal traffic or sudden spikes. No more over-provisioning "just in case" or late-night panic when traffic patterns change.

2. Multi-tenant Architecture with Isolation

Unlike traditional databases where you might have dedicated instances, DynamoDB takes a different approach - it runs multiple customer's workloads on the same physical machines.

Now, you might be thinking, does it not introduce the noisy neighbour problem? This is where DynamoDB's clever resource management comes in:

Resource Reservation - Each table gets guaranteed capacity through token bucket allocation. When you provision throughput or use on-demand tables, DynamoDB reserves corresponding resources, ensuring your workload gets its fair share regardless of other tenants
Strict Workload Isolation - Storage nodes host replicas from different tables and customers, but use partition-level token buckets to enforce throughput limits. Even if one tenant's workload spikes, it can't consume more than its allocated capacity, protecting other tenants' performance.
Dynamic Load Balancing - DynamoDB's autoadmin service continuously monitors storage node utilization. If a node's throughput exceeds a threshold percentage of its maximum capacity, DynamoDB automatically redistributes partitions to maintain balanced load across the fleet.
Proactive Capacity Management - Rather than waiting for resource exhaustion, DynamoDB tracks consumed capacity trends and automatically rebalances partitions before performance is impacted. Storage nodes independently report when they're approaching capacity limits.

These mechanisms ensure workload isolation between co-resident tables while maintaining high resource utilization. The benefit of this is cost savings, which gets passed down to customers while giving performance guarantees.

Multi-tenancy in DynamoDB

Autoadmin service is responsible for fleet health, partition health, scaling of tables, and execution of all control plane requests.

3. Boundless Scale for Tables

DynamoDB takes another contrarian approach with this principle. In traditional databases, we often have to carefully plan capacity and manage storage thresholds. Unlike these databases - for DynamoDB there are no predefined limits for the amount of data each table can store. Tables grow elastically to meet the demand of the customers' applications.

DynamoDB is designed to scale the resources dedicated to a table from several servers to many thousands as needed. How does DynamoDB achieve this seemingly infinite scalability? The secret lies in its partitioning strategy

When you create a table, DynamoDB splits it into multiple partitions. Each partition being a dedicated slice of storage and compute that's responsible for a specific range of keys. These ranges are organized so that no two partitions overlap (disjoint ranges), and each partition handles all keys within its assigned range (continuous ranges).

For example, if your table uses numeric keys from 1-100, you might have:

Partition A: Keys 1-35
Partition B: Keys 36-70
Partition C: Keys 71-100

Each key range is handled by exactly one partition, with no gaps or overlaps between partitions. As your table grows in size or throughput demands increase, DynamoDB automatically splits these partitions when:

Data Size: When a partition's stored data exceeds its size threshold. This ensures no single partition grows too large to manage.
Throughput: When a partition's consumed capacity consistently exceeds partition limits. DynamoDB monitors actual usage patterns and splits partitions based on observed traffic, not just size.

This automatic splitting enables tables to scale horizontally as they grow. Each new partition adds more dedicated storage and compute capacity to your table, all managed transparently by DynamoDB.

Automatic splitting

4. High Availability

In today's world, even a few minutes of database downtime can severely impact businesses. DynamoDB is architected for high availability, offering a 99.99% SLA for regular tables and an even higher 99.999% SLA for global tables.

At its core, DynamoDB achieves this through a sophisticated replication strategy. Every partition of your table is replicated across multiple Availability Zones (AZs) - physically separate data centers in the same region. But replication alone isn't enough - it's how DynamoDB manages these replicas that makes the difference.

Each partition has multiple replicas distributed across different AZs, forming what's called a replication group. Within each group, one replica acts as the leader, coordinating writes and strongly consistent reads.

Interaction Between Core Components

Here's how it works:

For writes

When a write request arrives, the leader generates a WAL (write ahead log) record
This record must be persisted by a quorum of replicas before acknowledging the write
Even if one replica fails, as long as the quorum exists, writes continue uninterrupted

For reads

Strongly consistent reads are served by the leader
Eventually consistent reads can be served by any replica
This flexibility allows it to balance load and maintain availability even during failures

In case when things go wrong, DynamoDB has multiple layers of failure protection:

Fast recovery - If a replica becomes unhealthy, DynamoDB quickly adds a "log replica", which is a lightweight replica that only stores recent write logs. This ensures durability while a full replica is being rebuilt.
Smart failure detection - Before triggering failovers, replicas check with their peers to confirm if a leader is truly unavailable, preventing disruptions from temporary network issues
Automatic healing - DynamoDB's autoadmin service continuously monitors partition health and automatically replaces unhealthy replicas, ensuring your data remains triple-replicated

Interaction inside a Replication Group

DynamoDB maintains this level of availability without requiring any operational overhead from developers. You don't need to configure replication, manage failovers, or worry about cross-AZ communication - it's all handled automatically.

5. Fully Managed Cloud Service

DynamoDB takes another fundamentally different approach to database management compared to traditional databases - it's fully managed, meaning developers never have to worry about the underlying infrastructure. This principle goes beyond simple automation; it's baked into DynamoDB's core design:

Resource Management

Developers interact purely through the DynamoDB API to create tables and perform operations
No need to think about server provisioning, storage allocation, or cluster configuration
The service automatically handles resource scaling based on your needs

Operational Tasks

All software patching and upgrades happen transparently
Hardware failures are automatically detected and remediated
Backups are managed systematically without impacting performance
Data encryption is handled automatically at rest and in transit

Infrastructure Management

No need to configure distributed database clusters
Automatic recovery from failures without manual intervention
Built-in security features like encryption and access control
Continuous monitoring and maintenance without downtime

The beauty of this approach is that developers can focus entirely on their application logic rather than database administration. Whether you're running a small application or handling millions of requests per second, the operational complexity remains zero.

6. Flexible Use Cases

While DynamoDB is often categorized as a NoSQL database, it's designed with flexibility in mind, accommodating various data models and consistency requirements. This flexibility manifests in several key ways:

Data Model Flexibility - DynamoDB doesn't force developers into a particular data model or consistency model. DynamoDB tables don't have a fixed schema but instead allow each data item to contain any number of attributes with varying types, including multi-valued attributes. Tables use a key-value or document data model.

Consistency Options - Developers can choose between eventual or strong consistency for reads, which allows applications to balance performance and consistency reads. You could choose to have strong consistency for critical operations and eventual consistency for better performance when absolute consistency isn't required.

Query Patterns - DynamoDB supports both single item operations and range queries, with secondary indexes for flexible access patterns and global tables for low latency multi region access. It also supports ACID transactions across items without compromising the scalability, availability, and performance characteristics of DynamoDB tables.

Common Pitfalls

While DynamoDB's architecture solves many scaling challenges, teams often hit roadblocks when approaching it with a traditional database mindset. Here are common pitfalls to avoid:

Poor Partition Key Design

Using sequential IDs as partition keys, causing write hotspots
Using too few unique partition keys, leading to throttling
Using highly correlated values (like dates) that create temporal hotspots
Better Approach: Choose partition keys with high cardinality and good distribution. Consider composite keys or adding random suffixes for better distribution

Inefficient Access Patterns

Creating too many indexes, increasing write costs and complexity
Not aligning schema with query patterns
Better Approach: Design your schema around your access patterns. Start with your queries and work backwards to your data model.

Over-Relying on Secondary Indexes

Creating an index for every possible query pattern
Not considering the write amplification cost
Better Approach: Use sparse indexes and carefully evaluate the cost-benefit of each index

Inappropriate Item Size

Storing items that approach the 400KB limit, not considering item size growth over time
Better Approach: Consider chunking strategies for large items, and plan for data growth.

Having faced these challenges at scale, I'll share our journey of overcoming them in an upcoming article

Cost Optimization Strategies

One of the most painful lessons teams learn with DynamoDB is how quickly costs can spiral without proper planning. Let's dive into strategies that can help you optimize costs while maintaining performance.

Choosing Your Capacity Mode

DynamoDB offers two capacity modes, each with its own trade-offs.

Provisioned capacity

Provisioned Capacity is best for predictable workloads and lowers cost when properly sized. But it requires careful capacity planning and one should also consider using auto scaling to handle traffic variations

On-demand capacity

On-demand capacity is perfect for unpredictable workloads or development environments where traffic patterns are still unclear. While it might cost more than steady-state workloads, the operational simplicity and peace of mind can be worth the premium.

Smart Capacity Planning

Success with DynamoDB requires a data-driven approach to capacity planning. Start by monitoring your CloudWatch metrics to understand your usage patterns. Look for daily and weekly trends, seasonal spikes, and any anomalies. If you're just starting out, consider using on-demand capacity during development - the usage data you gather will be invaluable for optimizing costs later.

Lifecycle Management

Just like cleaning out your closet, regularly reviewing and managing your data can lead to significant savings. Use TTL (Time To Live) for temporary data like session information or logs. For historical data that you need to keep but don't access frequently, consider archiving to S3. Another often-overlooked strategy is using sparse indexes - they not only improve query performance but can significantly reduce storage costs.

Stay Alert: Monitor Your Costs

The best way to avoid cost surprises is to stay vigilant. Set up CloudWatch alarms for throttling events - they're often a sign that your capacity planning needs adjustment. Monitor your consumed capacity against what you've provisioned to identify opportunities for optimization. Track costs at a granular level - by table and index - to understand where your money is going. Most importantly, set up budget alerts to catch any unexpected increases before they become painful surprises.

Conclusion

DynamoDB's architecture represents a fundamental shift in how we think about databases. Its design principles - from predictable performance to global admission control - weren't chosen arbitrarily, but evolved from real-world challenges of running databases at scale.

As I've experienced firsthand at Zomato, success with DynamoDB requires embracing these principles rather than fighting them. Whether it's rethinking our partition keys for better distribution, chunking large items for efficient storage, or carefully planning our capacity needs - each decision needs to align with DynamoDB's architectural foundations.

DynamoDB isn't just another database with a different query language. By understanding it's principles and avoiding common pitfalls, you can harness this same power for your applications, regardless of scale. The effort you invest in understanding its architecture and aligning your design decisions with its principles will pay dividends in scalability, reliability, and operational simplicity.

References:

https://www.usenix.org/system/files/atc22-elhemali.pdf

https://hemantkgupta.medium.com/insights-from-paper-part-i-amazon-dynamodb-a-scalable-predictably-performant-and-fully-6d93dfbfe2fb

Top comments (1)

grant horwood • Feb 6

excellent article.

DEV Community