The storage of persistent data requires data structures that use the capabilities of a block store to organize and index information for efficient and complete retrieval. Traditionally file systems were the data structures organizing block stores. The traditional alternative to a file system as a database, which constituted an entirely different field of computing than was operating systems.
While filesystems have developed in efficiency and sophistication, the variety of compromises that must be negotiated has fragmented the notion of persistent store more and more drastically in modern computing. For instance, there is a demand for evanescent file systems — where files are temporarily and efficiently available, with a URI identifier, but disappear after a certain time limit.
We will study three data stores that underlie some modern systems for data persistence. These three data stores do use files and file systems, but accommodate their limitations and provide the desired properties by building on top of files. In particular, it is recognized that,
- file systems are slower then memory, and the fewest trips to the block store must be made,
- file systems can support arbitrarily location reads, but only support well a write that appends to the end of an existing file, or creating a new file,
- it is simpler to handle files if they have a special immutability — either two copies of the files are identical, or they differ in size and agree exactly if the longer file were to be truncated to the size of the smaller file.
Having special immutability gets rid of problems of file consistency, as there is never an alternative version of the file. This simplifies replication and backup — backup consists of keeping copies of whatever file exists, and replication consists of having all the files on multiple machines. If files have versions, if file are deleted only to reappear later with different content (by the same name), it is not all that simple,
I have used the term special immutability to encompass both the true immutability of a specially immutable file combined with the promise of not more appending writes, and the simplistic consistency that there never be two inconsistent appends to a specially immutable file while the file is open for appending writes. We will term these open and closed specially immutable files, respectively.
Bitcask is a simple datastore for key-value pairs that makes use of in-memory hashing and specially immutable files. See Bitcask: A log-structured hast able for fast key/value data by Justin Sheehy and David Smith of Basho Technologies, 2010.
A Bitcask instance has any number of special immutable files, all of which are called inactive, except for one active file. The inactive files are closed, the active file is open. These files are a sequence of variable length records which contain a checksum, a time stamp, the key string with its length, and the value string with its length.
In in-memory hash table maps keys to file identifier and position in the file, where in is the key-value pair for that key, if there is such a pair. Updates to the key-value pair created new record by appending to the active file, and hash table only points to the newest entry. Hence, updates are made by obsoleting, but not necessarily deleting, old data. Since deletions are hard to implement for hash tables, a deletion is modeled as an update to a tombstone value. If key is matched with a tombstone, Bitcask returns the same as if the key does not exist.
New keys append the key-value pair to the active file, then an entry is made in the hash able. Doing things in this order makes the system crash proof — incomplete updates can be reapplied, and are inconsequential until completed.
When the active file reaches a certain maximum permitted size, it is made inactive, immutable, and a new active file is started.
Since there can be collisions in the hash table, depending on how the hash table is implemented, there might be multiple file reads to retrieve a key value; however it will expected constant time, for a properly administrated hash table.
The storage requirement under scheme as described so far is proportional to the number of key-value updates, including the initial value of a key. (Also add in the number of deletions, but this can only affect size by a factor or two, hence the statement of proportionality is unaffected.)
To save space, obsolete values and tombstones are removed by scanning the collection of inactive files, creating a series of new inactive files, and re-hashing based on the series of new inactive files and the single active file. Note that the old files are retired, the are more irrelevant than destroyed. Backups and replicants will not be corrupted if they continue to host these old files. They are just not referred to by the current hash table.
The details of rebuilding the inactive files is not given in the paper. It could be that the files are walked, and for each key-value pair, the key is read to see if this key-value pair is the most current; if not the pair is skipped, else it is copied into the new file.
LevelDB was created by Google in 2011. Unlike Bitcask, LevelDB is a key-value store that supports range queries: the keys are linearly ordered and it is possible to return all key-value pairs for all keys falling inside the query range. It builds on top of a specially immutable files called a SST, a sorted string table. An SST is a file consisting of sequence of key-value pairs, each key and value an arbitrary string, where the keys appear in increasing order. Each SST has an index giving the key and the offset in the file where the key appears. This index can be brought into memory to give a log(N) time access to a key in an SST with one file read.
LevelDB has three sorts of files,
- an appendable log file for new or updated key-value pairs
- a collection of level 0 SST’s, containing key-value pairs that were in a recently closed log file, now sorted,
- a collection of level i SST’s for i greater than 0, which contain key-value pairs, sort and in disjoint ranges by file.
Log and Level 0 SST’s
Writes append to the log file. When the log file becomes a certain size, its contents are sorted and written to an immutable SST, called level 0. The collection of level 0 SST’s might contain keys in overlapping ranges — the level 0 SST contain all and any key-value pairs written in a time interval.
Level i compaction of SST’s
Level 0 SST’s are periodically collected up with Level 1 SST’s and the entirely of key-value pairs found in this collection are sorted and a new series of Level 1 SST’s are created. These Level 1 SST’s contain disjoint ranges of keys so that search of a key first binary searches for the single SST file in which the key might occur, and then binary search in the index of the located SST for the possible offset in that file in which the key occurs.
Further compaction occurs by periodically collecting up Level i and i+1 SST’s which overlap a given key range, and resorted all key-value pairs in these files to create a new set of level i+1 SST files.
For the purpose of removing obsolete key-value pairs, the level 0 SST’s must be kept in chronological order. Sweeping through the key-value pairs, any key already encountered in the sweep blocks any new key-value pair from being copied forward — it is obsolete. Otherwise, there is at most one key-value pair for a key on any level, and level i has a more recently pair than level i+1.
Tombstones must be passed to the next level during compaction, since a level i tombstone needs to block a key-value pair on a level j which might be higher than i+1. It will be eventually migrated out when compacting into the highest level SST.
Read and writing a key value
Writing is by appending the key-value pair to the log file. Updates make old key-values obsolete, and then will be mask by the search order during reads. Deletes are done through writing of tombstones. The level compaction will drive out obsolete and deleted key-value pairs.
Reading is accomplished by,
- scan the log file for the key;
- if key not found, search in every level 0 SST for the key, using binary search on the associated index file;
- if key not found, beginning with level i, and continuing to level i+1 if not found, search among the level i SST to identify the single SST covering the range of the queried key, then search in the identified file using binary search on the associated index file.
In any step, if the key is found, return the value. If a tombstone is found, return that the key does not exist. If the key is not found in the log or any table, return that the key does not exist.
The search time is therefore L + (M+I) log N, for N elements, with a log of size L, I levels, and at most M level 0 SST’s.
For more information, see Ilya Grigorik’s page.
Bigtable is one of the most important data structures used by Google. It uses sharding to partition that set of key-value pairs into ranges by key, and distributes the shards to hosts, that store the range sorted in SST’s. A query to the table first identifies the host that stores any key in the range including the query key, then the query is sent to that host for it to search among its SST’s, using binary search of the indices associated with the hosted SST’s.
Range queries as supported. All key-value pairs with a key in the given range can be selected and returned.
The Bigtable structure is a precursor of LevelDB, and consequently LevelDB mimics it. The table is broken into smaller tablettes, each tablette handles a range of keys. The tablettes are distributed over hosts. A metadata file, comprising a B-tree of tablette ranges, is first queried to find a the tablette. The metadata file uses Chubby, a strong consistency protocol based on Paxos, to keep a single matadata view.
Each tablette is created as a the logical merge of SST and a memtable. The memtable keeps the most recent changes, and is in memory. A minor compaction closes the memtable and transfers it to an SST. To limit the width of the merge among SST’s, periodic merging compactions replace a collection of SST’s by a single SST, retiring those SST’s so replaced. A major compaction is a merging compaction that merges all SST’s in a tablette.
See Bigtable: a distributed storage system for structured data, Fay Chang, Jeffry Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, all at Google, ODI 2006