The book ‘Design Data Intensive Application’ is a really popular book recently years. Why? The reason is that it is relatively ‘easy’ for the inexperience application software engineers to follow and get a glance of what distributed world looks like. This is my second time to read this book. The first time reading was kind of a casual glance through and I gave up halfway. After I reflected on my learning strategy, I come up with a plan for me to read and digest this book.

I will share the learning notes in a series of blogs. In each post, I will start with a brief Chapter Summary, followed up a Mind Map that summarizes the main topics and its relations. The topics in the mind map are attached with a list of questions what can be answered by the knowledge in the book. The questions and answers are listed in plain text in the Questions Summary section. In the Highlighted Topics section, a few interesting topics in the chapter are picked up to explain in details. In the last Extended Topics section, one or two topics that is not covered in detail in the book is briefly discussed and explored.

This is the fifth post of the series.

Chapter Summary

This chapter is the start of the three related topics regarding storage systems - replication, partition and transaction. Replication is important because of these important properties it provides: 1) keep the data geographically close to the users 2) allow system to continue to work even if some of its parts have failed (redundancy) 3) allows scaling out to serve the increasing throughput.

Three common patterns, namely single-leader, multi-leader and leaderless are discussed. In the single-leader pattern, the challenge is mainly the follower set-up and replication lag. For multi-leader and leaderless pattern, the main challenge is the write conflict detection and resolving. This chapter discussed the definition of the patterns, the challenges and various solutions in detail.

Mind map

Question Summary

Single-leader

What are synchronous replication and asynchronous replication?

Synchronous replication means that when write happens, the successful response to the client will only be made after copying the data to all the followers. Asynchronous replication means that the successful response is immediately returned to client when a write is saved to the leader. The copying from leader to followers will happen in background asynchronously.

What is semi-synchronous replication and why it is good?

In practice that synchronous replication is impractical because one unavailable node will block all the writes. Semi-synchronous refer to the way that the data write to the leader, is synchronously copied to 1 follower, and then copied to other followers asynchronously. This ensures that at any time, at least two nodes have the up-to-date copy of the data.

What is the trade-off of async replication?

Weak durability. In the case that a leader fails before all the data is copied to followers, there will be data lost.

What are the 4 steps to setup a new follower?
What is the log sequence number in PostgreSQL and binlog coordinate in MySQL?

These two are the same concept, refers to the exact position in the leader’s replication log that the snapshot of the database is associated with.

What are the 3 steps for an automatic failover process?

Step1: determine the leader has failed, normally use a ping timeout Step2: choose a new leader through an election process Step3: reconfiguring the system to use the new leader

Why manual operation to handle leader failover is preferred?
What is split brain?

More than one leader is running in a single leader replication architecture. If there’s no process to handle write conflict, it will result in data lost or corruption.

What are the four leader-to-follower replication methods?
In a statement-based replication approach, what conditions may break the replication?
  1. Any statement use nondeterministic function (NOW(), RAND())
  2. Statement use auto-incrementing column
  3. Statements that have side effects (e.g. triggers, stored procedures, user-defined functions)
What is the disadvantage of write-ahead log shipping replication approach?

The log data is stored in a very low level, which makes the replication closely coupled to the storage engine. Normally the log cannot be consumed by different versions of the engine, which makes the zero-down time version upgrade challenging.

What are the different ways to solve replication lag issue?

Multi-leader

What are the ways (name 4) of achieving convergent conflict resolution?
What are the commonly seen multi-leader replication topologies?

Circular topology, star topology and all-to-all topology

Why all-to-all topology is better than circular and star topology?

Avoid the single point of failure

What is the replication topology in multi-leader replication architecture?

A replication topology describe the communication paths along which writes are propagated from one node to another

What is the causality issue of all-to-all topology and what is the solution to that?

Causality issue refer to the problem that some operations does not arrive to the system in the order expected due to the network latency. So one operation that depends on the other might arrive earlier (e.g. an update statement comes before an insert of the item). Version vector is the solution to the causality issue

Leaderless

How does an unavailable node catch up on the writes when it comes back online?
What is the benefit of choosing r and w to be majority (more than n/2) of nodes?

This choice ensures w+r>n (quorum condition) while still tolerating up to n/2 node failures

If a workload pattern is few writes and many reads, what is better config for the quorum and why?

Reduce the r, say to 1, increase the w, say to n (so r + w > n holds). In this case, we can reduce the load of read. However increase w to n has the disadvantage that a single unavailable nodes will cause write fall.

Define strict quorum reads and writes

Define the number of nodes that confirms writes are w, the number of nodes required for read is r, and the total number of nodes is n. When w + r > n, it guarantees that at least one of the nodes for read contains the latest updated writes.

Define sloppy quorum

In a large cluster with significantly more than n nodes, writes and reads still require w and r successful responses, but those may include nodes that are not among the designated n “home” nodes for a value.

Highlighted Topics

quorums condition

Leaderless architecture becomes popular after Amazon used it for its in-house Dynamo system (the paper is written in 2007). Riak, Cassandra and Voldemort are open source datastores with leaderless replication models inspired by Dynamo, so this kind of database is also known as Dynamo-style.

Quorums is the essential concept to guarantee the data consistency in Dynamo-style data storages. It is defined as this:

If there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes fro each read. As long as w + r > n, we expect to get an up-to-date value when reading, because at least one of the r nodes we’re reading from must be up to date. Reads and writes that obey these r and w values are called quorum reads and writes

ddia-quorum.excalidraw

We can smartly choose w and r in different use cases:

However even with w + r > n, there are likely to be edge cases where stale values are returned. Dynamo-style databases are generally optimized for use cases that can tolerate eventual consistency (BASE vs ACID).

Extended Topics

version vector

Consider two operations A and B happens at very similar time, there’s only three different possibilities in terms of the which happens first:

If we can identify the causality between A and B, then we can say that A and B are not concurrent and we can safely keep the result from the most recent operation. Otherwise, we want to keep the conflicted result from the concurrent operations A and B, and let the client to resolve the conflict (In simpler cases, we can take the “later” operation based on the timestamp or replica number order, but it comes with the risk of data loss).

To detect the concurrency, we can take the approach to detect the causality, if no causality, means that the operations are concurrent.

A simple algorithm to capture the happens-before relationship

Important aspect to consider when reasoning about the scalability of these version vector is the existence of three different orders of magnitude at play:

Here’s a summary and comparison of different version vector approaches

Version vector with causal histories

causal histories ({a1, a2}) are maintained in this version vector solution, but this is a not useful in practical systems because they scale linearly with the number of updates.

version-vector_1.excalidraw

Causally compliant total order (LWW)

Assuming that client clocks are well synchronised and applying real time clock order, in this approach, replica nodes never store multiple versions and writes do not need to provide a get context. Cassandra is using this simplest approach.

version-vector-2.excalidraw

Version vectors with per-server entry

A concise version vector can be achieved by storing all the update-to-date server+version information with each replica. In this case, it is possible to track the causality among updates that were received in different servers. However, this approach cannot track causality among updates submitted to the same server. Dynamo system uses one entry per replica node and thus falls into this category.

version-vector-3.excalidraw

Version vector with per-client entry

The version vectors with one entry per replica node are not enough to track causality among concurrent clients. One natural approach is to track causality by using version vectors with one entry per client. In the implementation, each client needs to maintain a counter increment it and provide it in each PUT, otherwise we will encountered the problem illustrated below. This approach is not very practical mainly because it is a big challenge to maintain the clients stability of the storage system - the application can run concurrent and dynamic threads, it is difficult to the identify and track them.

version-vector-4.excalidraw

Dotted version vectors

Dotted version vectors use a concise and accurate representation for the clocks to be used as a substitute for the classic version vectors. It uses only server-based ids and only a component per replica node, thus avoiding the space consumption explosion. The formal definition of the version vector is

\[\begin{align*} &C[(r, m)] = \{r_i | 1 \leq i \leq m\}, \\ \\ &C[(r,m,n)] = \{r_i| 1 \leq i \leq m\} \cup \{r_n\}, \\ \\ &C[X] = \bigcup_{x\in X}C[x] \\ \\ &\text{In a component (r,m,n) we will always have n > m} \end{align*}\]

The example shows how the version vectors are updated for each update:

version-vector-5.excalidraw

As we see, the popular databases like dynamoDB and Casandra are using simpler version of the version vectors in practice. This is very common. In reality, production-level engineering system comprises complexities for performance and engineering simplicity.

Further Reading

Buy Me A Coffee