The book ‘Design Data Intensive Application’ is a really popular book recently years. Why? The reason is that it is relatively ‘easy’ for the inexperience application software engineers to follow and get a glance of what distributed world looks like. This is my second time to read this book. The first time reading was kind of a casual glance through and I gave up halfway. After I reflected on my learning strategy, I come up with a plan for me to read and digest this book.
I will share the learning notes in a series of blogs. In each post, I will start with a brief Chapter Summary, followed up a Mind Map that summarizes the main topics and its relations. The topics in the mind map are attached with a list of questions what can be answered by the knowledge in the book. The questions and answers are listed in plain text in the Questions Summary section. In the Highlighted Topics section, a few interesting topics in the chapter are picked up to explain in details. In the last Extended Topics section, one or two topics that is not covered in detail in the book is briefly discussed and explored.
This is the third post of the series.
This is an interesting and exciting chapter! Finally it starts to talk about data systems. The discussion starts from the illustration of how data store can be done with a simple text file. Extending from the concept, a simple storage engine with appending-only log is introduced - hash index. The basic knowledge makes the understanding of the classic LSM-tree storage engine easy. Complement to the LSM-tree storage engine, another classic B-tree storage engine is discussed in detail.
In the second section of the chapter, the difference between OLAP and OLTP databases is discussed. Diving deeper into the OLAP and data warehouse concept, column-oriented data storage engine is introduced.
1. How does Hash Indexes work?
Key-value stores are used (similar to the hashmap data structure). The key is the index key and the value is byte offset of the value in the data file. The hash index is stored in memory for fast access. When reading the data from the index, we can read the byte offset of the key from the index, and we only need to load the value data from the disk with just one disk seek. The data file is append only and compaction and segment merging are performed to reduce the storage
2. What is compaction (of a key-value update log)?
For a log-structure database, compaction on the log file segments means throwing away duplicate keys in the log and keeping only the most recent update for each key.
3. How to handle partially written records in hash index?
Use checksums to delete the corrupted parts of the log
4. How to handle the data deleting in hash index?
Append a special deletion record to the data file (called tombstone). When log segments are merged, the tombstone tells the merging process to discard any previous value for the deleted key
5. What is checksums?
A checksum is an alphanumeric value that uniquely represents the contents of a file. Checksums are often used to verify the integrity of files downloaded from external source.
6. What are the limitations of hash index?
- the hash table must fit in memory, it does not work for data that has a very large number of keys
- range queries are not support, you need to fetch data key one by one for a range of keys
7. What are the two ways to solve the issue of a key-value index where the key is not unique?
- making each value stored with the keys a list of matching row identifier
- by appending a row identifier to the key to make the key unique
8. What is the reason that in-memory data storage performs faster than disk-based data storage?
The overhead resides mainly in the data encoding and decoding process as data stored in very different structure in memory and in disk. Surprisingly the read speed to memory and disk is not the deciding factor although they are in huge difference. (Anyway, the disk-based storage could avoid read from disk given enough memory provided and data cached)
9. What is LSM tree?
Log-structured Merge-tree. Storage engines that are based on the principle of merging and compacting sorted files are often called LSM storage engines.
10. How does the LSM-tree storage engine works (SSTable + Memtable)?
- write to the in-memory sorted balanced tree (red-black tree), also called a memtable
- when the memtable size is larger than a threshold, write to the disk as SSTable. When copying is ongoing, the newly coming write will go to a new memtable
- to process a read request, first check the memtable, if does not find, check the latest SSTable and so on.
- on the background, a process will periodically performing merging and compaction on the SSTables to delete data and remove duplicates
- bloom filter is added as a first layer to check the existence of the key before it hits to the memtable and SSTables
11. What is a memtable?
It is a in-memory cache of the latest set of record writes applied to the database. It is normally in a data structure (e.g. red-black tree) that supports log(n) time for insert and lookup. The memtable will be write to disk as an SSTable after the size passes some threshold.
12. How does the LSM-tree storage engine improve the read efficiency for non-existing key?
- Use a bloom filter to check if key exists before searching for the keys in memtable and SSTables
13. How does a LSM-tree storage engine handle server crash?
The server crash will cause the lost of memtable. In this case, the service will keep a separate append-only log file in disk to record each operation write to the memtable. This is only used to recover the memtable from crash.
14. How does a B-tree storage engine work?
- B-trees break the database down into fixed size blocks or pages (usually 4kb in size) and read or write one page at a time
- Each page can be identified using an address or location, which allows one page to refer to another, we can use these page reference to construct a tree of pages
- when performing read, start from the root of the b-tree and look for the child page the key resides until reaching the leaf page
- when performing write, search for the leaf key, then caching the value in that page and writing it back to disk after update. If there isn’t enough space accommodate the new key, it creates a new page and updates the parent’s pointer to point to the new page
15. What is a heap file?
When the db index stores the value’s identifier, the actual data of the value is stored in a place called heap file. Often it stores data in no particular order. It is common because it avoids duplicating data when multiple secondary indexes are present.
16. What is a clustered index?
It is refer to the type of index that the value stored directly with the key in the index, instead of an identifier of the value stored. This will ensure a better performance than the later. One example is that MySQL InnoDB storage engine, the primary key of a table is always a clustered index, and the secondary indexes refer to the primary key (not using heap file)
17. Compare B-trees and LSM-trees?
- B-tree is used to support page-oriented storage engine (used in MySQL, MongoDB etc). It stores the key and value (location to the value) and reference to other pages as small chucked pages in a tree-structure to allow efficient access to the keys
- LSM-trees is used to support log-structured storage engines (e.g. LevelDB). It consists of a in-memory memtable and in-disk SSTables. The memtable provides fast write and read and keeps the key in sorted order. The data in memtable is periodically loaded to SSTable for persistence. Merging and compacting is performed on SSTables to remove duplications
- B-tree enabled data storage is better suitable for read-heavy applications while LSM-trees database is more suitable for write-heavy applications. B-tree is more mature compared to LSM-trees.
18. Why some wide-column database e.g. BigTable, Cassandra, Hbase are not column-oriented database?
These database are not column stores in the original sense of the term, since their two-level structures do not use a columnar data layout. They introduces the column-family concept. Within each column family, they store all columns from a row together, along with a row key, and they do not use column compression.
19. What are OLTP and OLAP? What are the main differences?
Online transactional processing vs online analytical processing.
- read pattern: OLTP is for a small number of record per query, OLAP aggregate over large number of records
- write pattern: OLTP is for random access with low latency, OLAP is for bulk import or event stream
- end users: OLTP for web app and end users, OLAP for data analytics and decision maker
- data representation: OLTP represents the latest state of the data, OLAP represents the historical events
- data size: OLTP - GB to TB, OLAP - TB to PB
I find the discussion about the LSM-tree storage engine is really interesting and inspiring in this chapter. Here I use illustrations to describe how a simplest storage solution (a plain file) can be extend and evolved to an efficient and full-functional storage engine.
A plain text file as a simple storage engine is illustrated below. The write operations (insert/update/delete) is always append to the end of the log file. The read operation is always start from the beginning and scan to the end. This results to a simple structure, extremely terrible read performance and unbeatable efficient write performance.
Hash Index Storage
To improve the read efficiency, the concept of hash index is introduced. It is a in-memory hash map to store the key and the offset to the log file where the value is appended. Use the offset information, one disk seek is required to fetch the value. Compaction and merging strategy is used to reduce the log file size resulting from the append-only write strategy
LSM Tree storage
One of the problem of hash index storage engine is the hash index has to be fit into the memory. Memtable and SSTable are introduced in LSM-tree storage to solve the problem. Sorted key concept here is very important. It makes the searching and merging efficient with the sorted nature of the key. Bloom filter and write-ahead log are some common technics to improve the performance and reliability.
B-tree and B*-tree storage engines
TODO; just want to publish this post sooner to motivate myself to continue to write