The book ‘Design Data Intensive Application’ is a really popular book recently years. Why? The reason is that it is relatively ‘easy’ for the inexperience application software engineers to follow and get a glance of what distributed world looks like. This is my second time to read this book. The first time reading was kind of a casual glance through and I gave up halfway. After I reflected on my learning strategy, I come up with a plan for me to read and digest this book.

I will share the learning notes in a series of blogs. In each post, I will start with a brief Chapter Summary, followed up a Mind Map that summarizes the main topics and its relations. The topics in the mind map are attached with a list of questions what can be answered by the knowledge in the book. The questions and answers are listed in plain text in the Questions Summary section. In the Highlighted Topics section, a few interesting topics in the chapter are picked up to explain in details. In the last Extended Topics section, one or two topics that is not covered in detail in the book is briefly discussed and explored.

This is the first post of the series.

Chapter Summary

This is the first chapter of the book. It explains the fundamental concepts of application design. In designing an application, we need to consider two aspects - functional requirement and non-functional requirement. Functional requirements vary case by case, however non-functional requirements can be generalized. In this Chapter, three most commonly discussed non-functional requirements of data application are explained in detail - reliability, scalability and maintainability.

Mind Map

Questions Summary

1. What does scalability of software mean?

The ability of the application to be able to handle increasing load without downgrading performance

2. What does reliability mean for software? List four main descriptions

Generally means that the software is continuing to work correctly, even when things go wrong.

3. What does an ‘elastic’ system mean?

A system that is able to automatically scale up or down by detecting the load change

4. What does ‘tail latency’ mean and why is it important?

Tail latency refer to the high percentile response time (p95, p99 and p999). It is important as they affect the users’ experience of service, most often the more valuable customers’ experience as they are more likely to experience higher latency with large data volume or high throughput

5. What does ‘fan-out’ mean in the software application context?

In the transaction processing system, we use it to describe the number of requests to other services that we need to make in order to serve one incoming request. (This is a term borrowed from electronic engineering, where it describes the number of logic gates inputs that are attached to another gate’s output. The output needs to supply enough current to drive all the attached inputs)

6. What are the three major types of errors that cause reliability issues of a software?
7. What are the three main design principles for a maintainable software?
8. What are the examples of new tools that do not fit exactly into the traditional data systems (databases, queues, caches)
9. What are the common functionalities of a data-intensive application?
10. What are some of the commonly used load parameters when describing the load of an application?
11. How to define the “percentiles” of response time?

Highlighted Topics

Twitter’s timeline fan-out architecture

A classic architecture design problem is discussed in this chapter to demonstrate how an architecture decision should be made based on the load pattern of a service. The example is the Twitter’s home timeline service.

There’s two most commonly used operations by twitter users - post tweet and load the home timeline. The load to these two operations are (in 2012) are 4.6k qps (12k peak) for post write and 300k qps for home timeline read. From the data we can observe that the read operation has two orders of magnitude higher than the write. There are two approaches to support the home timeline read:

1. Query on read

This approach queries the home timeline post when read request is received. The querying logic is represented in the relational database tables as shown below.

2. Cache on write

Another approach is to cache the home timeline on write. In this architecture, the system maintains a home timeline cache for each user which works like a mailbox. When a user post a tweet, the system will fan out the tweet (typical a tweet id) to the cache of all the followers of this user. In this case, the heavy computation is shifted to the write operation, whereas the read operation is just a simple query to the cache.

Since the application is ready-heavy, it is better to do more work on write and less on read. In this case, approach 2 is preferable. However the approach 2 has its own constraint that in cases that a user with large number of followers (e.g. celebrities), a single tweet post write will generates a massive number of following writes to the cache, results a huge spike of load to the cache. So twitter had worked on approaches that combines approach 1 and approach 2 to achieve optimized result.

There’s some more interesting topics covered in the infoQ talk Timelines at Scale, where Raffi Krikorian explains design trade offs in time timeline fan-out architecture, compared it with the search architecture and the possibility of merge them to support challenging use cases. See some notes I summarized from the talk in this section.

Percentiles of response times

Percentiles is a very important concept in describing the performance of a real-time service. Since the response time of a real-time service varies due to a lot of random parameters (disk mechanical vibration, network vibration, GC and software bugs), it is not possible to user a single number to describe the performance of such a service. In this case, a percentiles is a very useful term - sort the response time from fastest to slowest, the median in the sorted list is the 50th percentile (p50), similar, the p95, p99 and p999 are the 95th, 99th and 99.9th positioned data in the list. This number tells us how well the end users’ experience is.

In reality, tail percentiles (p95, p99, p999) should be seriously concerned. Because normally the clients with large data volume or higher throughput are more likely to experience higher latency, while these clients are those that are more valuable to business. In production practice, we are often monitoring and alerting on p99 or p999 latency based on how we define our SLA and SLO.

Extended Topics

More on Twitter’s architecture

Notes from Raffi Krikorian’s infoQ talk Timelines at Scale

1. How timeline is supported

See the simple architecture diagram shared in the talk

2. How search is supported

Fan-out and search are two paths that are behave opposite to each other, as shown

To solve the expensive fan-out cost of high value users (users has massive number of followers), the experiment being done is to stop fanning out the tweets from those users, instead fanning out only the tail users. When querying home timeline of a user, the user’s home timeline data will be merged with the high value user’s user timeline and delivered.