Welcome to another Sunday blog, where we explore fascinating concepts in distributed systems and their real-world applications!
Imagine the database of Instagram. With millions of users, there is a lot of load. If not scaled properly, this could lead to performance degradation and user loss. In software, we may encounter similar situations where we need to enhance database performance. We will discuss how this can be achieved.
Replication
The database can be replicated to balance the load. A common replication strategy is the leader-follower architecture.
In this setup, the leader handles write requests, while the followers handle read requests. Write-ahead logs are maintained by the leader for the write requests. The followers update themselves based on these logs. Followers can update themselves both synchronously and asynchronously, depending on the design. Each approach has its pros and cons. Synchronous updates can cause the leader to become unavailable if one follower goes down, while asynchronous updates may result in followers not being fully updated.
A middle approach is for one follower to update synchronously, while others update asynchronously (PostgreSQL adopts this design). On the other hand, AWS RDS and Azure SQL support read replicas and automated failover.
One limitation of replication is that while it improves read scalability, it does not enhance write performance. Additionally, the entire database must fit on a single machine. This can be mitigated by offloading certain tables to separate databases on different nodes, but it only postpones the underlying issue.
Partitioning
Traditional relational databases do not support partitioning, as they were designed with high availability in mind. They support replication but not partitioning.
Partitioning allows a database to scale horizontally for both reads and writes. It can be implemented on the application side, but this comes with complexity. Distributed transactions, for example, must be implemented using 2PC. Additionally, if tables are spread across different machines, queries must be framed at the application level. We must also remember that while storage has become cheaper, CPU remains expensive.
NoSQL is More Scalable
Open-source solutions like HBase and Cassandra, as well as managed services like Bigtable and DynamoDB, have revolutionized scalable storage systems. The term NoSQL initially referred to first-generation data stores that did not support SQL. Today, NoSQL stores have evolved significantly, and although they still do not support SQL, they support dialects of SQL.
SQL databases are known for consistency and transactions, while NoSQL databases prioritize high availability with relaxed consistency. Generally, NoSQL stores do not support join operations. The data is stored as key-value pairs or documents, and it is typically unnormalized. A pure key-value store associates an opaque sequence of bytes (the key) with another opaque sequence of bytes (the value). In contrast, a document store maps a key to a document, which may have a hierarchical structure and does not enforce a strict schema. The key distinction is that documents in a document store are interpreted and indexed, enabling queries based on their internal structure.
Thus, NoSQL databases are more scalable and available, but transactions may be limited. For example, Cosmos DB by Azure supports transactions only within a single partition. When used correctly, NoSQL can accommodate many of the same use cases as traditional relational databases, with the added benefit of scalability.
We will use DynamoDB by AWS as an example.
Amazon DynamoDB as an Example
In DynamoDB, each entry in a table can have multiple different attributes, but each must have a primary key that uniquely identifies it. The primary key consists of a partition key, combined with a sort key (or simply the partition node where the data resides). This allows for efficient range queries based on primary keys.
DynamoDB maintains three replicas per partition and synchronizes them using state machine replication. Writes are sent to the leader, and the client is acknowledged once at least two of the three replicas successfully record the write. Thus, we can infer that DynamoDB supports three services: CRUD operations on single items, querying multiple items (provided they share the same primary key or partition key), and scanning the entire table.
Join operations are not scalable, and DynamoDB does not support them. However, proper data modeling in DynamoDB can eliminate the need for join operations. For example, in a traditional relational database, roll numbers, scores, and names might be stored in two separate tables, and a join would be required to retrieve the score for a specific roll number. In DynamoDB, we can store both the score and name as separate entries with the same partition key (roll number) and different sort keys, making querying more scalable than using joins.
Additionally, DynamoDB supports secondary indexes to enable more complex access patterns. Local secondary indexes provide alternate sort keys within the same table, while global secondary indexes allow for different partition and sort keys. However, it's important to note that index updates are asynchronous and eventually consistent.
The latest trend in scalable data stores is the rise of NewSQL, which combines the scalability of NoSQL with the ACID guarantees of relational databases. Unlike NoSQL stores that prioritize availability over consistency during network partitions, NewSQL databases focus on consistency. This approach suggests that, with proper design, the slight decrease in availability due to enforcing strong consistency is negligible for many applications. CockroachDB and Spanner are examples of NewSQL data stores.
Conclusion
In conclusion, while traditional relational databases focus on consistency, NoSQL offers scalable solutions that prioritize availability and flexibility, making them ideal for modern applications.
Here are some links to my previous posts, which I publish every Sunday on distributed systems:
- Building Resilient Applications: Insights into Scalability and Distributed Systems
- Understanding Server Connections in Distributed Systems
- How are your connections with web secure and integral?
- Understanding System Models in Distributed Systems
Feel free to check them out and share your thoughts!
Top comments (0)