DEV Community

Ujjwal Raj
Ujjwal Raj

Posted on

Understanding Data Partitioning for Scalable Distributed Systems

A software company needs data. This era is all about data, where we have a lot of data, and there is a lot of effort being made to process it to derive as much insight as possible to inform business decisions and practices. But how is the storage and retrieval of this data efficiently handled in distributed systems? Today, we will understand partitioning—one way of increasing data retrieval and storage efficiency in a system.

Welcome to another Sunday blog, where we explore fascinating concepts in distributed systems and their real-world applications!

What is partitioning?

Imagine a social media company like Instagram. Instagram has a lot of users, and the number is increasing day by day. As data increases, it becomes impossible to store everything in a single system, database, or disk. Thus, we need to add more shards or partitions. This is called sharding or partitioning. Partitioning also helps in distributing access request loads across a single database. Millions are scrolling Instagram at a time. Such a huge number of fetch requests from users' phones to a single data partition would be impossible to fulfill. So, partitioning helps here.

Image description

In Instagram, users' clients send the access request to a Gateway server, which redirects the request to the partition where the data resides. We will understand how this management is done.

We should also keep in mind that the partitioning logic should ensure uniform data distribution among partitions. Also, the access request traffic should be uniform across them. Only then can we say the system is scaled.

The Gateway does the data mapping. It’s like assigning the key (the data) a value (the partition number).

Range Partitioning

Range partitioning is a method of dividing data into segments based on a key column's value range.

Image description

The major challenge in range partitioning is the uneven distribution of data. Some partitions have a huge amount of data, while others have less. For example, range partitioning on Instagram based on follower count can lead to uneven data distribution, with partitions like 0-500 followers overloaded, while partitions like 5001+ followers remain sparse. This imbalance can cause performance bottlenecks.

Another major challenge is hotspots. A hotspot is a partition with a lot more access requests for data. For instance, if the data is partitioned by date, a single node will handle all requests for the current day.

In range partitioning, rebalancing is also a problem. Whenever the load increases, new partitions need to be added. In case the load decreases, some partitions need to be merged. This is called dynamic partitioning. These rebalancing actions will require a huge amount of data transfer from one node to another.

Hash Partitioning

Hash partitioning is a method of distributing data across multiple partitions by applying a hash function to a key (such as a user ID or order ID). The hash function computes a hash value, which is then used to determine the partition where the data will be stored. This technique ensures that data is distributed evenly across partitions.
For example, partition_value = hash_of_key mod N formula can be used to distribute keys among N partitions.

While hotspots are still an issue in normal hash partitioning, consistent hashing (an improved method of hash partitioning) involves less data transfer when repartitioning.

Consistent Hashing

Image description

The main idea behind consistent hashing is that data is allowed a random hash position in a circle and is placed on the clockwise closest node in the hash circle. When a node joins or leaves, only the neighboring data points are affected, which avoids the need to rehash all keys. This makes it ideal for distributed systems like load balancing, caching, and distributed databases.

Image description

You can notice in the figure, if we add a partition P4, only data between P2 and P4 needs to be relocated.

There is one con of hash partitioning over range partitioning: the sorted order of data is lost, which would have helped in optimal data scans.

Conclusion

In conclusion, partitioning is crucial for efficiently managing large-scale data in distributed systems, ensuring balanced storage, retrieval, and access.

Here are some links to my previous posts, which I publish every Sunday on distributed systems:

Feel free to check them out and share your thoughts!

Top comments (0)