The book ‘Design Data Intensive Application’ is a really popular book recently years. Why? The reason is that it is relatively ‘easy’ for the inexperience application software engineers to follow and get a glance of what distributed world looks like. This is my second time to read this book. The first time reading was kind of a casual glance through and I gave up halfway. After I reflected on my learning strategy, I come up with a plan for me to read and digest this book.
I will share the learning notes in a series of blogs. In each post, I will start with a brief Chapter Summary, followed up a Mind Map that summarizes the main topics and its relations. The topics in the mind map are attached with a list of questions what can be answered by the knowledge in the book. The questions and answers are listed in plain text in the Questions Summary section. In the Highlighted Topics section, a few interesting topics in the chapter are picked up to explain in details. In the last Extended Topics section, one or two topics that is not covered in detail in the book is briefly discussed and explored.
This is the second post of the series.
This chapter briefly discussed the three types of data models: relational, document and graph database. Some histories about database are introduced to give us a better ideas why these databases are developed and what types of problems they solve.
Personally I think this chapter is loosely structured and too much detail about the query languages is provided. Advice given to anyone who read this chapter: do not spend much time in understanding the query language syntax here, because the information given in the book will not help anyone to master the particular language (if you want to get into one query language, follow a well designed tutorial a specific book). Instead, you should try to understand the philosophy of each type of data model and identify the scenarios to apply them.
1. What is ORM frameworks?
Object-relational mapping framework, e.g. Hibernate
2. What is data locality?
Data locality is the process of moving computation to the node where that data resides, rather than moving data to save bandwidth. This helps to minimize network congestion and improve computation throughput.
In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time.
3. What is RDBMS?
relational database management system
4. What are three types of ways to store one-to-many relationship in database?
- normalized representation of the data: to store the data to multiple tables and use foreign key reference to link the data from one source to multiple others
- store the data in the predefined data structure that supported by the data store engine, e.g. json format supported by MySQL or PostgreSQL. The data is being able to be indexed and queried.
- store the data as pure text field and allow the application to interpret the data. In this case, the data will not be able to be indexed.
5. What are the typical use cases of graph-like data models?
- social media
- web graph
- road or rail networks
- recommendation engine
6. What are the difference between JPA and Hibernate and Spring Data JPA?
JPA (Java persistence API, nothing to do with Spring) is an interface (specification).
Hibernate is a JPA provider or the implementation.
Spring Data JPA is either an interface nor implementation, it is an abstraction used to reduce the amount of boilerplate code required to implement data access layers. Spring Data JPA cannot work without a JPA provider.
7. What are schema-on-write and schema-on-read?
Schema-on-write refers to the data systems that requires the explicit definition of data schema. When data is write to the database, it will be verified against the defined schema, any content that does not conforms to the schema will not be allowed to write.
Schema-on-read refers to the data systems that does not require the explicit definition of data schema. The data schema will be interpreted from the data when it is read.
8. What are NoSQL databases? What are the different types of NoSQL databases?
NoSQL database is a non-relational DBMS that does not require a fixed schema, avoid joins and easy to scale. There are four types of NoSQL database: Document databases (MongoDB), key-value stores (DynamoDB, Redis), wide-column database (Bigtable, Cassandra, Hbase) and Graph database (Neo4j).
9. In a property graph database model, what are the two data concepts? And what does each of them consists of?
node (vertex) and relationship (edge)
- node consists of labels and properties
- relationship consists of a source node, a target node, a type and properties
10. What does ployglot persistence mean?
When design an application data storage, choose multiple types of storage based on the different use cases of the components of the application.
11. What is impedance mismatch?
This is the term used to refer to the problems that object-oriented programming language’s data model has a mismatch of the SQL database model, that an awkward translation is required between them.
ORM frameworks like Hibernate reduce the amount of boilerplate code for this translation layer.
Property Graph model (Neo4j and Cypher) and Triple-store model (SPARQL) are two classic graph data models that is widely used. Here’s a brief summary and comparison of these two data model.
I had some hands on with Neo4j beginner’s tutorials, however I felt little I can learn and master without a practical project to work on. Unfortunately I have not got a chance to use graph database in production level projects yet. It is just valuable additions to my toolbox for building applications in the future.