[DDIA] Chapter 3 - Storage and Retrieval

Chapter Summary

This is an interesting and exciting chapter! Finally the book starts to talk about data systems. The discussion begins with the illustration of how data store can be designed using a simple text file. Extending from the concept, a simple storage engine with append-only log is introduced - hash index. The basic knowledge about hash index makes the understanding of the subsequent classic LSM-tree storage engine easy. Complement to the LSM-tree storage engine, the classic B-tree storage engine is also explained in detail.

In the second section of the chapter, the difference between OLAP and OLTP databases is discussed. Diving deeper into the OLAP and data warehouse concept, column-oriented data storage engine is introduced.

Mind Map

Questions Summary

💡 How does Hash Indexes work?

Key-value stores are used (similar to the hashmap data structure). The key is the index key and the value is byte offset of the value in the data file. The hash index is stored in memory for fast access. When reading the data from the index, we can read the byte offset of the key from the index, and we only need to load the value data from the disk with just one disk seek. The data file is append only and compaction and segment merging are performed to reduce the storage

💡 What is compaction (of a key-value update log)?

For a log-structured database, compaction on the log file segments means throwing away duplicated keys in the log and keeping only the most recent update for each key.

💡 How to handle partially written records in hash index?

Use checksums to delete the corrupted parts of the log

💡 How to handle the data deleting in hash index?

Append a special deletion record to the data file (called tombstone). When log segments are merged, the tombstone tells the merging process to discard any previous value for the deleted key

💡 What is checksums?

A checksum is an alphanumeric value that uniquely represents the content of a file. Checksums are often used to verify the integrity of files downloaded from external source.

💡 What are the limitations of hash index?

The hash table must fit in memory, it does not work for data that has a very large number of keys
Range queries are not support, you need to fetch data key one by one for a range of keys

💡 What are the two ways to solve the issue of a key-value index where the key is not unique?

Making each value stored with the keys a list of matching row identifier
By appending a row identifier to the key to make the key unique

💡 What is the reason that in-memory data storage performs faster than disk-based data storage?

The overhead resides mainly in the data encoding and decoding process as data stored in very different structure in memory and in disk. Surprisingly the read speed to memory and disk is not the deciding factor although they are in huge difference. (Anyway, the disk-based storage could avoid read from disk given enough memory provided and data cached)

💡 What is LSM tree?

Log-structured Merge-tree. Storage engines that are based on the principle of merging and compacting sorted files are often called LSM storage engines.

💡 How does the LSM-tree storage engine works (SSTable + Memtable)?

Write to the in-memory sorted balanced tree (red-black tree), also called a memtable
When the memtable size is larger than a threshold, write to the disk as SSTable. When copying is ongoing, the newly coming write will go to a new memtable
To process a read request, first check the memtable, if does not find, check the latest SSTable and so on.
On the background, a process will periodically performing merging and compaction on the SSTables to delete data and remove duplicates
Bloom filter is added as a first layer to check the existence of the key before it hits to the memtable and SSTables

💡 What is a memtable?

It is a in-memory cache of the latest set of record writes applied to the database. It is normally in a data structure (e.g. red-black tree) that supports log(n) time for insert and lookup. The memtable will be write to disk as an SSTable after the size passes some threshold.

💡 How does the LSM-tree storage engine improve the read efficiency for non-existing key?

Use a bloom filter to check if key exists before searching for the keys in memtable and SSTables

💡 How does a LSM-tree storage engine handle server crash?

The server crash will cause the lost of memtable. In this case, the service will keep a separate append-only log file in disk to record each operation write to the memtable. This is only used to recover the memtable from crash.

💡 How does a B-tree storage engine work?

B-trees break the database down into fixed size blocks or pages (usually 4kb in size) and read or write one page at a time
Each page can be identified using an address or location, which allows one page to refer to another, we can use these page reference to construct a tree of pages
When performing read, start from the root of the b-tree and look for the child page the key resides until reaching the leaf page
When performing write, search for the leaf key, then caching the value in that page and writing it back to disk after update. If there isn’t enough space accommodate the new key, it creates a new page and updates the parent’s pointer to point to the new page

💡 What is a heap file?

When the db index stores the value’s identifier, the actual data of the value is stored in a place called heap file. Often it stores data in no particular order. It is common because it avoids duplicating data when multiple secondary indexes are present.

💡 What is a clustered index?

It is refer to the type of index that the value stored directly with the key in the index, instead of an identifier of the value stored. This will ensure a better performance than the later. One example is that MySQL InnoDB storage engine, the primary key of a table is always a clustered index, and the secondary indexes refer to the primary key (not using heap file)

💡 Compare B-trees and LSM-trees?

B-tree is used to support page-oriented storage engine (used in MySQL, MongoDB etc). It stores the key and value (location to the value) and reference to other pages as small chucked pages in a tree-structure to allow efficient access to the keys
LSM-trees is used to support log-structured storage engines (e.g. LevelDB). It consists of a in-memory memtable and in-disk SSTables. The memtable provides fast write and read and keeps the key in sorted order. The data in memtable is periodically loaded to SSTable for persistence. Merging and compacting is performed on SSTables to remove duplications
B-tree enabled data storage is better suitable for read-heavy applications while LSM-trees database is more suitable for write-heavy applications. B-tree is more mature compared to LSM-trees.

💡 Why some wide-column database e.g. BigTable, Cassandra, Hbase are not column-oriented database?

These database are not column stores in the original sense of the term, since their two-level structures do not use a columnar data layout. They introduces the column-family concept. Within each column family, they store all columns from a row together, along with a row key, and they do not use column compression.

💡 What are OLTP and OLAP? What are the main differences?

Online transactional processing vs online analytical processing.

Read pattern: OLTP is for a small number of record per query, OLAP aggregate over large number of records
Write pattern: OLTP is for random access with low latency, OLAP is for bulk import or event stream
End users: OLTP for web app and end users, OLAP for data analytics and decision maker
Data representation: OLTP represents the latest state of the data, OLAP represents the historical events
Data size: OLTP - GB to TB, OLAP - TB to PB

Highlighted Topics

I find the discussion about the LSM-tree storage engine is really interesting and inspiring in this chapter. Here I use illustrations to describe how a simplest storage solution (a plain file) can be extend and evolved to an efficient and full-functional storage engine.

Simple Storage

A plain text file as a simple storage engine is illustrated below. The write operations (insert/update/delete) are always to append to the end of the log file. The read operation is always start from the beginning and scan to the end. This results to a simple structure, extremely terrible read performance and unbeatable efficient write performance.

Hash Index Storage

To improve the read efficiency, the concept of hash index is introduced. It is a in-memory hash map to store the key and the offset to the log file where the value is appended. Use the offset information, one disk seek is needed to fetch the value. Compaction and merging strategy is used to reduce the log file size resulting from the append-only write strategy.

LSM Tree storage

One problem of hash index storage engine is that the hash index has to fit into the memory. Memtable and SSTable are introduced in LSM-tree storage to solve the problem. Sorted key concept here is very important. It makes the searching and merging efficient with the sorted nature of the key. Bloom filter and write-ahead log (WAL) are some common technics to improve the performance and reliability.

Extended Topics

B-tree and B+tree

B-tree is the most widely used indexing structure. It was introduced in 1970 and have stood th test of time. It remains the standard index implementation in almost all relational database, and many nonrelational databases too (one of the popular - MongoDB). How B-tree works?

B-trees break the database down into fixed size blocks or pages (usually 4kb in size) and read or write one page at a time
Each page can be identified using an address or location, which allows one page to refer to another, we can use these page reference to construct a tree of pages
When performing read, start from the root of the b-tree and look for the child page the key resides until reaching the leaf page
When performing write, search for the leaf key, then caching the value in that page and writing it back to disk after update. If there isn’t enough space accommodate the new key, it creates a new page and updates the parent’s pointer to point to the new page

To better visualize:

What is B+tree and what is the advantage of that?

B+tree has almost the same structure as B-tree, but it has the leave nodes linked to each with pointers. This gives a huge advantage in range-search since we could traverse through the leave nodes directly instead of doing a backtracking search for each leave node within the search range.