Introduction
Last year I read more challenging projects every programmer should try and decided to take a deep dive into researching solutions for a key-value data store.
There are two major technologies used in this space. Relational databases typically use B-trees to persistently store data. Alternatively many newer storage solutions use the log-structured merge tree (LSM tree).
LSM trees are specifically designed to handle write-heavy workloads. They are used in many popular NoSQL databases including Apache Cassandra, Elasticsearch, Google Bigtable, Apache HBase, and InfluxDB. As well as embedded data stores such as LevelDB and RocksDB.
This article provides implementation details for the LSM tree based project keyva that I wrote to better understand how all of this works. The details are generic and cover concepts applicable to other LSM tree implementations.
High-Level Design
Data is always added to an LSM tree using sequential writes. This append-only approach allows for blazingly fast write operations but does require subsequent compaction to free extra records written when a key is updated or deleted.
The MemTable, a data structure stored entirely in memory, is initially used to store new data. Operations here are very fast but space is limited and the data cannot be retained if the process is restarted.
In order to recover data across restarts, the same data is also appended to a Write Ahead Log (WAL). The WAL is a simple append-only log that contains a single record for each operation made to the LSM tree.
Eventually the MemTable will become too large to efficiently hold in memory and the data is flushed to a Sorted String Table (SST) file on disk. SST files are indexed and immutable, allowing fast concurrent data access. Eventually when enough SST files are generated a background job will compact them and merge the data into a new "level" of SST files. This gives the tree a chance to remove redundant records and efficiently re-organize data.
SST files can efficiently serve large data sets. For example, Google Bigtable uses log-structured storage and is designed to scale to the petabyte range across a cluster of servers.
Data Model
Records
Keyva uses a LSM tree to store data in terms of key/value pairs. Each key is an UTF-8 encoded string and each value is a sequence of bytes.
Inserts and Updates
A new key/value is inserted into the LSM tree by first being added to the MemTable and WAL, as depicted in the high level design.
You may be surprised to learn that an update is added to the LSM tree in the exact same manner as an insert!
The only difference for an update is that if the key already resides in the MemTable it will be overwritten with the new value. But if an existing key has already been flushed to disk as part of an SST file the old data will remain on disk for some time until the new key/value from the update is eventually merged into that SST level.
So, both insert and update operations are incredibly efficient but their simplicity makes read a bit more complicated. Let's look at that next.
Reads
So, when reading data an LSM tree must find the most recent value for a given key. Thus any read operation must start with the MemTable before moving to SST level 0, level 1, etc.
Once read finds a key it will return with that first value that it has found. So in the case of an update it will simply ignore any previous values for a key that still reside in higher SST levels.
Deletes
Data cannot be deleted directly from an SST. Instead the key is flagged as deleted and the data is deleted later when the SST is compacted. These flagged records are called tombstones.
You can see the Deleted
flag used in our implementation:
type SstEntry struct {
Key string
Value []byte
Deleted bool
}
Value
may be stored as an empty array for a deleted record, so at least we save a bit of space there.
Unfortunately a tombstone cannot be immediately removed when its SST is compacted. In order to guarantee all of the old records for a key are removed the tombstone must reach the highest SST level in the tree. At that point it can be safely removed along with any corresponding value(s) for the key.
Write Amplification
Write amplification is the ratio of the amount of data written to the LSM tree versus the amount of data written to disk. The same data may be written to disk multiple times as a key/value is promoted from one level of the SST to another. To increase performance an important consideration is to minimize the number of repeated disk writes.
Data Structures
MemTable
All data added to the LSM tree is initially stored in Memtable, essentially an in-memory cache.
Data in the MemTable needs to be arranged for fast access and ideally for low-cost concurrent read/write operations. A self-balanced tree such as a red-black tree can work well for this purpose. Our implementation uses a skip list.
If a key already exists in the table when a request is received the value will be updated directly. This is different than the other data structures employed by the LSM tree, which are immutable.
Deletes must be retained in the table as well. It is important to store a tombstone in case the key still contains data in the SST. The deletion will be resolved later when we compact SST files.
Finally, when the MemTable reaches a certain threshold it must be flushed to disk. A potential optimization here is to allocate a new MemTable and designate the current MemTable as read-only. The old table can then be set off to the side for a background job to write it to disk.
Write Ahead Log
The WAL is a plain-text file containing a dump of all operations on the table. Essentially a transaction log of all operations on the MemTable.
This allows reconstructing the in-memory portion of the tree in the event of service restart for data that has not been flushed to SST yet.
In our implementation a separate WAL file is used for each MemTable. After a MemTable is written to disk its WAL file is purged, as the data is now retained in persistent storage by an SST. This prevents infinite growth of the WAL.
Finally, data is not organized efficiently within the WAL. So we only want to use it as a means of data recovery for the MemTable.
Sorted String Table
SST files are the primary data representation for storing an LSM tree on disk. Each one contains a series of key/values sorted by key:
Each file is immutable, making it easier to access data concurrently.
Sparse Index
A sparse index may be used to find data contained in an SST file. The index does not need to contain all keys since data is sorted. Instead it may include every Nth key.
Bloom Filter
A bloom filter is used to determine if an SST might contain a key before we check the SST. If the bloom filter cannot find a key then we know the key cannot be contained in the corresponding SST.
This helps speed up read operations by reducing the amount of disk accesses when reading data:
Data Layout
Level
SST files are organized into a series of multiple levels starting at level 0. Each level contains more data than the last and the maximum number of levels is configurable. Each level also contains multiple SST files. For example consider the following SST files in level 0:
As shown above files at level 0 may contain overlapping data. For example, observe how the first file contains a key for "Tucson" while the first key in the second file is "Atlanta".
This ordering is necessary as files are added on-demand as the MemTable reaches capacity. The problem is that in order to find the most recent value for a key in level 0 each SST file must be checked. We need to start from the most recent file and work back to the oldest file.
Higher levels are arranged more efficiently. Consider the same data after being merged to level 1:
As you can see data is guaranteed to be in sorted order across all files in this level. A single binary search may be used to find the SST file containing a given key.
As a result we probably want to minimize the amount of data in level 0, especially for a large dataset.
Each subsequent SST level will contain more data than the previous. However if each subsequent level contains an order of magnitude more data than the previous there do not need to be many levels.
Segment
Data is divided into segments on disk, one per SST file.
Each segment also contains a header (sequence number) and an index.
Block
Data within an SST is divided into blocks. There is one sparse index per block.
SST files may be binary files that are optionally compressed to save space.
Merge
A k-way merge algorithm is used to combine data from many SST files into a new series of SST files. Or alternatively to compact data within a single SST level.
In keyva, sst.Compact
implements a streaming merge algorithm intended to scale up to large datasets. For each SST file a single record containing the latest string and a file handle are added to a min heap. The next record for the output file is then taken from the top of the heap and written to the current output SST file. A new record from the source SST is added back to the heap, and the process continues until all SST data is merged.
By using a min heap we can efficiently write new SST data in sorted order even if we are merging data from a large number of source SST files.
Data can be merged at regular intervals or at certain thresholds. For example a time series database might merge data at certain time intervals (daily, hourly, etc).
Conclusion
And there you have it. Thanks for reading!
The full source code for my sample implementation is available on pkg.go.dev.
Top comments (4)
Why did you use a skip list for the memtable instead of the built-in map in Go?
Go's built-in map is not thread-safe, so mutexes must be used for concurrent access. For high performance it is better to use a data structure that is better suited for concurrent access, such as a skip list.
Could you get inspiration from what Java's ConcurrentHashMap does? Essentially, it uses a technique called lock striping—instead of holding a single lock on the entire hash map, different locks are held depending on the bucket. As far as I can recall, Java uses 16 different locks spread across the entire hash map. This helps reduce contention a bit. Furthermore, it has the concept of read-write locks, so multiple readers can acquire the lock simultaneously, as long as no other thread is writing.
However, thinking about this again, the SkipList with an RWMutex probably works similarly. The striping behavior would come from the SkipList's inherent hierarchical structure, while the RWMutex could allow multiple readers to hold the lock simultaneously.
from your GetFunction on search disk,you check index files range all levels especially higher level. since files in same level whose level bigger than 0 in sorted order, why need check one by one?