Nikita Kutsokon

Posted on Mar 1

Databases: How Data is Stored on Disk

#webdev #backend #database #systemdesign

A database system is designed to store, read, and write data efficiently. It organizes the data into pages and utilizes both disk storage and RAM to manage these pages. But before diving into how databases handle data, let's first explain what RAM is for a better understanding of the article.

What is RAM

RAM (Random Access Memory) is the short-term memory of your computer, used to store data and programs that are actively in use. It’s a critical component because it allows the computer to quickly access the information it needs without waiting for slower storage devices like hard drives (HDDs) or solid-state drives (SSDs). RAM is much faster than these storage devices, and when you open a program or file, it gets loaded into RAM to improve performance. The more RAM your computer has, the more data and programs it can store and process at once, which makes multitasking and running demanding applications faster and smoother. However, RAM is volatile, meaning that when you turn off your computer, all the data stored in RAM is erased. This is why it’s essential for running programs in real-time, but it doesn’t hold onto data permanently.

How the Database Stores Data

A database stores data in small blocks called pages, which typically hold several rows of a table or other pieces of data. Each page is a fixed size (usually between 4 KB and 8 KB), and this consistency helps the system efficiently manage large datasets. When data from a table is stored in a database, the rows of that table are divided and placed across multiple pages. Ea*ch page can contain one or more rows* depending on the size of the rows and the page size. For example, if a table has large rows (such as those with many columns or large text), fewer rows may fit on a page. Smaller rows will allow more to fit on a single page. This method of storing data ensures efficient access, as the database can quickly load entire pages into memory instead of fetching individual rows one by one. This helps speed up queries, especially when large amounts of data need to be processed or retrieved.

The data is stored on a disk, like a hard drive (HDD) or a solid-state drive (SSD). These disks preserve the data even when the computer is turned off. However, reading from the disk is slower than reading from the computer's RAM, which is faster but only works while the computer is on. To speed up data access, the database uses a special area in RAM called the buffer pool. This area stores frequently accessed data. Instead of reading small pieces of data one by one, the database reads whole pages at once. These pages are stored on the disk, and when needed, they’re quickly loaded into RAM for faster access.

How the Database Reads Data

When the database needs to access data, it doesn’t read individual rows or records directly from the disk. Instead, it reads entire pages, which are fixed-size blocks containing multiple rows or pieces of data. The database first checks if the required page is already in the buffer pool. If the page is found in RAM, it can be accessed quickly. If not, the database reads the page from the disk and loads it into RAM for future use. This way, the database minimizes the need to access the slower disk.

How the Database Writes Data

When the database needs to write data (adding new rows or updating existing data), it first writes the changes to the buffer pool in RAM, not directly to the disk. This is because writing to RAM is much faster. Later, the database writes the changes from RAM to the disk, typically in large batches, to minimize disk access time. This process is called buffering or write-back. To ensure data isn't lost, databases use transaction logs to track changes. If the system crashes, the transaction logs help recover the changes when the system restarts.

Disk I/O Optimization

Disk I/O (Input/Output) optimization plays a key role in enhancing the performance of databases. Since databases store large volumes of data on disks, which are slower than RAM, minimizing the time spent reading from and writing to these disks is essential. Optimizing disk I/O involves improving how a database interacts with its disk storage, resulting in faster and more responsive system performance. Let's explore some techniques used to achieve this:

1. Buffer Pool (Caching)
Think of the buffer pool like a temporary storage area in your computer's memory (RAM). When the database needs data, it first checks if the data is already in the buffer pool. If it is, the database can quickly use it without waiting for it to be read from the slower disk. This speeds up data retrieval, especially for commonly used data.

2. Read-Ahead (Prefetching)
Imagine you’re reading a book and you know the next chapter is important. Instead of waiting for each page, you turn several pages ahead. The database does something similar by predicting which data it might need next and loading it into memory ahead of time. This way, the data is ready when needed, reducing delays.

3. Write-Ahead Logging (WAL)
Before the database writes any new data to the disk, it first records the change in a special log file. This is like making a note before making any changes to ensure that if something goes wrong (like a power failure), the database can recover the changes from the log, preventing data loss.

4. Batching Writes
Instead of writing every single change to the disk one by one, the database groups several changes together and writes them all at once. This reduces the number of times the system has to access the disk, making the overall process more efficient.

5. Disk Striping (RAID)
Disk striping is like cutting a file into pieces and storing those pieces on different disks. When the database needs to access the file, it can read from multiple disks at once, which speeds up the process. This is often used in RAID setups to improve performance and ensure the data is safe.

HDD vs SSD in Database Storage

When storing data for a database, you can use either HDD (Hard Disk Drive) or SSD (Solid State Drive). Both have their strengths and weaknesses, so let’s break it down in simple terms.

HDD
HDDs work by using spinning disks, similar to a record player, with a small arm that moves to read or write data. Because these parts need to physically move, HDDs are slower and take more time to locate and load information. However, they are more affordable per gigabyte, making them a great choice for storing large amounts of data that don’t require fast access. This makes HDDs ideal for backups, archives, and databases with low traffic, where speed is less important than storage capacity.

SSD
SSDs use flash memory, a type of non-volatile storage that retains data even when the power is off. Unlike HDDs, which rely on spinning disks, flash memory stores data electronically in memory cells made of transistors. These cells hold an electrical charge to represent binary data (0s and 1s). Because there are no moving parts, SSDs have much lower latency, meaning they can access and transfer data almost instantly, which results in significantly faster read and write speeds. This makes SSDs ideal for applications that need high-speed performance, like real-time analytics and databases with heavy user traffic. Additionally, SSDs are more durable, consume less power, and generate less heat, making them a reliable and energy-efficient choice for high-performance computing

Helpful Links 🤓

Text resources:

Video resources:

DEV Community

Databases: How Data is Stored on Disk

What is RAM

How the Database Stores Data

How the Database Reads Data

How the Database Writes Data

Disk I/O Optimization

HDD vs SSD in Database Storage

Helpful Links 🤓

Top comments (0)

Read next

The Death of Create React App (CRA): Is It Time to Deprecate React Itself?

10 Mind-Blowing Open-Source Projects That Will Redefine Tech As You Know It!

https://dub.sh/sy8oXIn

Ideas for the Portfolio