[DDIA] Chapter 4 - Encoding and Evolution

Chapter Summary

Services communicate to each other by protocols (pre-defined interfaces and data models). As application developers, we are often making API design choices by choosing between HTTP or gPRC, or REST vs RPC vs GraphQL, or Web socket vs HTTP etc. The definitions of these terms are sometimes misunderstood a lot (at least by me for a long time). These terms are actually not mutual-exclusive.

This chapter provides some fundamental knowledge on data encoding (serialisation): different mechanisms of turning data structures into bytes for communicating over the network or storing on disk. The focus is on the compatibility properties of the different encoding strategies: how easily the backward and forward compatibility can each solution provide. At the later part of the chapter, how data encoding plays a role in the three common dataflows in application design namely data store, sync API communication and async message queue is discussed. Understanding the underlying mechanism of data encoding helps us to make better tech decisions when designing and implementing application softwares.

Mind Map

Question Summary

💡 Why we need to encode the data (translation between the in-memory data representation to a byte sequence that is sent over the network or sent to a file)?

The data stored in memory (normally as list, object, hash tables, trees) is designed for efficient access by the CPU. While the data transferred through the network or written to a file is in a format of self-contained sequence of bytes, which is quite different from the data structure used in memory.

💡 What are the problems of language specific encoding method? (e.g. java’s kryo, ruby’s marshal, python’s pickle)

They are language specified and have to tie with the chosen language
Most of the time the performance is not good enough
There’s no version control. Backward compatibility and forward compatibility is not easy to maintain
Security concerns: in order to restore the data in the same object type, the decoding process needs to be able to instantiate arbitrary class, which can be used by attackers to inject executable code

💡 What are the pros and cons of textual format encoding method? (e.g. xml, json, csv)

Pros:

Human readable, easy to understand and debug
Can be easily made to support backward and forward compatibility
Language-independent

Cons:

Verbose
Ambiguity around the encoding of numbers. Specific flaws for specific format. E.g. integer large than 2^53 cannot be parsed incorrectly by language uses floating-point numbers. JSON and XML does not support binary string
Weak support for schema definition
CSV does not have schema so there’s confusion to handle value with comma or newline

💡 What are the benefits of binary encoding?

Clearly defined backward and forward compatibility semantics
Efficient computation and result in compact size
Type and schema definition, used as good documentation and code generation for statically typed languages

💡 How do Thrift and Protocol Buffers handle schema changes while keeping backward and forward compatibility?

The field tag is used to refer to each field and it is never changed. There will be new fields with new field tags added. To maintain the forward compatibility, the old code will simply ignore the unrecognized field tag. For backward compatibility, the new code can always read old data because the tag numbers still have the same meaning. The only detail is that you cannot make the new field as required, as the old code will not have written the new field you added.

💡 What is the benefit of ‘repeat’ field type in Protocol Buffers?

The encoding of ‘repeat’ field type says that the same field tag simply appears multiple times in the record. This has the nice effect that it is okay to change an optional field into a repeated field. New code reading old data sees a list with zero or one elements while old code reading new data sees only the last element of the list.

💡 What are the three ways to communicate the writer’s schema to the reader when using Avro encoding?

When the dataset volume is large, it is ok to contain the writer’s schema in the file itself and sent.
When there’s upgrading of versions frequently and each entry of data might be written with different versions of schema. It is helpful to maintain a database of version number of the writer’s schema. Then send only the version number with the data to the reader. When the reader reads the data, it will retrieve the schema by the version number.
When two processes are communicating over a bidirectional network connection, they can negotiate the schema version on connection setup and then use the schema for the lifetime of the connection.

💡 How does Avro maintain the forward and backward compatibility?

Avro maintains two sets of schema - writer’s schema and reader’s schema. The schema resolution logic will check the compatibility and resolve the difference of the two. Adding or removing fields that has default value, we can maintain the compatibility.

💡 Considering the dataflow through database, in which scenarios that we need to handle the backward compatibility and the forward compatibility?

We need to consider both backward compatibility and forward compatibility. Backward compatibility needs to be handled by default, as the data written at a time, will be read at a later time. In the scenarios that a database is read and written by multiple instances, during rolling upgrade of the instances, the database may be written by a newer version of the code, and subsequently read by an older version of the code that is still running. Thus, forward compatibility is also often required for database.

💡 What is data outlives code?

Data in the database was written some time ago and might become incompatible with the newest version of code that reads and writes data.

💡 What can schema evolution bring to the database dataflow?

Schema evolution allows the entire database to appear as if it was encoded with a single schema, even though the underlying storage may contain records encoded with various historical versions of the schema.

Highlighted Topics

Understand the four encoding algorithms with an example

Sample data

{
	"username": "Martin",
	"favoriteNumber": 1337,
	"interest": ["daydreaming", "hacking"]
}

MessagePack

MessagePack has no schema, it is a binary encoding for JSON. The schema (keys) needs to be encoded in the message. So, the binary encoding is 66 bytes for this example, which is not much size reduction achieved here (original json is 81 bytes). It is not clear whether such a small space reduction is worth the loss of human-readability.

Thrift and Protocol Buffers

Thrift and protocol buffers are similar in the sense that schema is defined and fixed with tag number. With the re-defined schema, the message size is significant reduced since only the schema tag is required in the message. An obvious difference between thrift and protocol buffers is how the array type data is defined: thrift support list data type while protocol buffers use repeated marker to indicate that a field can be a list. This has the nice effect that it’s okay to change an optional field into a repeated field. New code reading old data sees a list with zero or one elements; old code reading new data sees only the last element of the list.

Avro

Avro is a subproject of Hadoop. It does not contain field tag in the schema. To parse the binary data, you go through the fields in the order that they appear in the schema and use the schema to tell you the datatype of each field. For Avro, the writer and reader maintains its own copy of schema, Avro library resolves the differences by looking at the writer’s schema and the reader’s schema side by side and translating the data from the writer’s schema into the reader’s schema.

Extended Topics

A better understanding of the following concepts.

REST vs RPC

REST: a philosophy that builds upon the principles of HTTP. It emphases on “resource-centric”. An API designed according to the principles of REST is called RESTful.

RPC: first we shall differentiate the definition of RPC used in different context. RPC (remote procedure calls), the idea has been around since 1970s. The RPC model tries to make a request to a remote network service look the same as calling a function or method in your programming language, within the same process. Nowadays, we use RPC often to describe:

protocols

Remote Procedure Call (RPC) is a protocol that provides the high-level communication paradigm used in the operating system. RPC presumes the existence of a low-level transport protocol, such as Transmission Control Protocol/Internet Protocol (TCP/IP) or User Datagram Protocol (UDP), for carrying the message data between communicating programs. RPC implements a logical client-to-server communications system designed specifically for the support of network applications. When RPC is made, the procedure identifier and parameters serialized into a request message then sent to the remote process which serves the call. On the remote process the message got deserialized back, the process run, and the results returned in a similar way.

frameworks

such as gRPC, Apache Thrift, Apache Avro, Apache Dubbo

Philosophy of API pattern

The RPC API thinks in terms of “verbs”, exposing the restaurant functionality as function calls that accept parameters, and invokes these functions via the HTTP verb that seems most appropriate - a ‘get’ for a query, and so on, but the name of the verb is purely incidental and has no real bearing on the actual functionality, since you’re calling a different URL each time. Return codes are hand-coded, and part of the service contract.

When we compare REST and RPC, we could be comparing HTTP API design pattern - use RESTful and RPC-style design pattern (another pattern faded out is SOAP). Or we could be evaluating tech decision on protocols, whether we should use HTTP (json) or RPC (gRPC or other frameworks build on top of the protocols).

HTTP vs gRPC

Explained above, we are comparing the json encoding and protocol buffers encoding

GraphQL vs REST

GraphQL is an application layer server-side technology that is used for executing queries with existing data. It is deployed over HTTP using a single endpoint that provides the full capabilities of the exposed service. It promotes a difference API design philosophy from the REST, it requires strong typed schema definition and supports schema stitching.

WebSocket

WebSocket connection is initiated by the client. It is bi-directional and persistent. It starts its life as a HTTP connection and could be “upgraded” via some well-defined handshake to a WebSocket connection. Through this persistent connection, a server could send updates to a client. WebSocket connections generally work even if a firewall is in place. This is because they use port 80 or 443 which are also used by HTTP/HTTPS connections.

WebSocket is the default choice for the online messaging service. We could just skip the discussion of HTTP vs WebSocket in supporting peer-to-peer messaging and opt for WebSocket directly. It is not some fancy technology, as Keith Adams said in his slack presentation, it could be some 40 lines of java code wrapping a standard WebSocket library with a for loop over the people that are connected.