As databases grow in size and complexity, ensuring efficient storage, retrieval, and management of data becomes a significant challenge. Two key strategies to handle large-scale data distribution are partitioning and sharding. While both techniques involve breaking down data into smaller segments, they serve different purposes and are used in different scenarios.
Partitioning
Partitioning is splitting a database table into smaller parts within one database
✨ Think of a database as a club with different rooms for different music genres. Partitioning is how you decide who goes where—pop lovers in one room, rock fans in another. ✨
Vertical & Horizontal Partitioning
Vertical Partitioning - splits a table into multiple tables by columns. Each new table contains a subset of the columns from the original table, let's look at an example:
Original Table:
customers (customer_id, name, email, address, phone_number)
Partitioned Tables:
customer_details (customer_id, name, email)
customer_contact (customer_id, address, phone_number)
Horizontal partitioning - splits a table into multiple tables by rows. Each new table contains a subset of the rows from the original table, let's explore an example:
Original Table:
sales (sale_id, product_id, sale_date, amount)
Partitioned Tables:
sales_2023 (sales from 2023)
sales_2024 (sales from 2024)
Types of partitioning
1. By range: Data is split based on a range of values (dates or numbers)
Suppose you have a music database and you want to partition the data based on the release year of the songs:
- Partition: Songs released from 1960 to 1969
- Partition: Songs released from 1970 to 1979
- Partition: Songs released from 1980 to 1989
- Partition: Songs released from 1990 to 1999
CREATE TABLE songs (
song_id INT,
title VARCHAR(100),
artist VARCHAR(100),
release_year INT
)
PARTITION BY RANGE (release_year) (
PARTITION p1 VALUES LESS THAN (1970),
PARTITION p2 VALUES LESS THAN (1980),
PARTITION p3 VALUES LESS THAN (1990),
PARTITION p4 VALUES LESS THAN (2000)
);
😎 Range partitioning is ideal when you need to analyze data over specific time periods. For example, if you want to study the evolution of music genres over decades. It also useful for archiving old data while keeping recent data easily accessible. For instance, you can archive songs from the 1960s and 1970s while keeping newer songs in more frequently accessed partitions.
2. By list: Data is split based on predefined categories (regions or groups)
Suppose you have a music database and you want to partition the data based on the genre of the songs:
- Partition: Rock songs
- Partition: Pop songs
- Partition: Jazz songs
- Partition: Classical songs
CREATE TABLE songs (
song_id INT,
title VARCHAR(100),
artist VARCHAR(100),
genre VARCHAR(50)
)
PARTITION BY LIST (genre) (
PARTITION p1 VALUES IN ('Rock'),
PARTITION p2 VALUES IN ('Pop'),
PARTITION p3 VALUES IN ('Jazz'),
PARTITION p4 VALUES IN ('Classical')
);
😎 List partitioning is best when you frequently query data based on specific categories, such as music genres, allowing for efficient retrieval of songs within each genre.
3. By hash: Data is evenly split using a hashing function
Suppose you have a music database and you want to partition the data evenly based on the song ID using a hashing function:
- Partition: Songs with song_id % 4 = 0
- Partition: Songs with song_id % 4 = 1
- Partition: Songs with song_id % 4 = 2
- Partition: Songs with song_id % 4 = 3
CREATE TABLE songs (
song_id INT,
title VARCHAR(100),
artist VARCHAR(100),
genre VARCHAR(50)
)
PARTITION BY HASH (song_id)
PARTITIONS 4;
😎 Hash partitioning is suitable when you need to ensure an even distribution of data across partitions, preventing hotspots and balancing the load evenly, especially useful for large datasets with uniform access patterns.
Pros & Cos
🟢 Pros:
- Improved Performance - queries that access a small portion of the data can be faster because they only need to scan relevant partitions. Indexing can be more efficient, as smaller indexes are faster to search.
- Easier Management - administrative tasks like backups and archiving can be performed on individual partitions rather than the entire table. Maintenance operations can be done on a per-partition basis, reducing downtime.
🔴 Cons:
- Complexity - designing and implementing an effective partitioning strategy can be complex and requires careful planning. It may require additional administrative overhead to manage partitions.
- Application Changes - existing applications may need to be modified to take full advantage of partitioning. Queries may need to be rewritten to optimize for partitioned tables.
Partitioning is most beneficial for very large tables. Smaller tables may not see significant benefits and could even experience performance degradation
Sharding
Sharding is splitting a database into smaller, independent databases (shards), where each shard stores a portion of the data
✨ Sharding is like hosting your party in multiple locations—one club for the 90s hits, another for techno. Each place handles its own crowd. ✨
Types of sharding
1. Ranged/Dynamic Sharding: Data is allocated to shards based on a predefined range of values from a specific field (shard key)
😎 Suitable for datasets where queries often target specific ranges of data, such as date ranges or numerical sequences. For instance, consider this case:
- Shard A: Records with IDs from 0 to 19
- Shard B: Records with IDs from 20 to 39
- Shard C: Records with IDs from 40 to 50
⚠️ Effective shard keys should have high cardinality and well-distributed frequency to avoid unbalanced shards.
2. Algorithmic/Hashed Sharding:Data is allocated to shards using a hash function applied to a field or set of fields
😎 Ideal for evenly distributing data across shards when a suitable shard key is not available. To give you an idea, here’s an example:
Hash Value = ID % Number of Shards
⚠️ Can lead to increased broadcast operations and complex resharding processes when the number of shards changes.
3. Entity/Relationship-Based Sharding: Related data is kept together on the same physical shard.
😎 Effective in relational databases where related data is frequently accessed together, reducing the need for broadcast operations. Suppose we take this scenario:
- Shard A: User data and related payment methods for users A-M
- Shard B: User data and related payment methods for users N-Z
⚠️ Requires careful planning to ensure related data is correctly grouped and managed.
4. Geography-Based Sharding: Data is allocated to shards based on geographic information, with shards often located in corresponding geographic regions.
😎 Improves performance and reduces latency by storing data closer to the users accessing it. As an illustration, let’s look at this example:
- Shard A: Data for users in North America
- Shard B: Data for users in Europe
⚠️ Effective for global applications where data locality is important for performance and compliance.
Pros & Cos
🟢 Pros:
- Scalability - sharding allows databases to scale horizontally by distributing data across multiple servers. This makes it easier to handle large volumes of data and high traffic loads.
- Performance - by distributing data and queries across multiple shards, you can improve query performance and reduce latency, as each shard handles a smaller portion of the data.
- High Availability - sharding can enhance availability by isolating failures to individual shards. If one shard goes down, the others can continue to operate, minimizing downtime.
- Fault Isolation - issues in one shard do not necessarily affect others, which can improve the overall reliability of the system.
- Geographic Distribution - sharding allows data to be distributed across different geographic locations, which can reduce latency for users in different regions and comply with data sovereignty regulations.
🔴 Cons:
- Complexity - implementing and managing a sharded database architecture can be complex. It requires careful planning and expertise to ensure data is distributed and accessed efficiently.
- Data Consistency - Maintaining data consistency across shards can be challenging, especially in environments where data changes frequently.
- Query Complexity - queries that span multiple shards can be more complex to implement and may require additional logic to aggregate results from different shards.
- Operational Overhead - sharding introduces additional operational overhead, including the need to manage multiple database instances and ensure they are properly synchronized.
- Resource Management - each shard requires its own resources (e.g., CPU, memory), which can lead to increased infrastructure costs and the need for more sophisticated resource management.
- Backup and Recovery - backing up and recovering data from a sharded database can be more complex compared to a single database instance, as each shard needs to be managed individually.
Sharding improves scalability by distributing data across multiple databases. It’s useful for large-scale systems but adds complexity. Smaller databases may not benefit
Partitioning vs Sharding
Helpful Links 🤓
Text resources:
- Sharding AWS team
- Sharding MongoDb team
- Sharding vs Partitioning
- When to Consider Postgres Partitioning
- How To Decide if You Should Use Table Partitioning
Video resources:
Top comments (0)