The Chapter 2 from “Designing Data-Intensive Applications” by Martin Kleppmann is covering concepts like data models and query languages.
Data models are having direct impact on how the software is written, but also define how we think about the problem we are working to solve.
There are multiple data models for data storage and querying and the most common used are relational models, document models and graph based data models.
Relational Model vs Document Model
The SQL data model was a theoretical proposal done by by Edgar Codd in 1970. In this model the data is organised intro relations which are called tables in SQL, and each relation is an unordered collection of tuples which are called rows in SQL.
The purpose of this relational model was to hide the implementation details behind a cleaner interface and proved to be present in the software world for more than 25 years. It dominated the network and hierarchical models which were the main alternatives in the 70s and 80s.
In the 2010s, NoSQL is another attempt to overthrow the relational model’s dominance. Its name is coming from a Twitter hashtag (later on reinterpreted as Not Only SQL) used for a meetup on open source, distributed, non-relational databases in 2009.
Why to adopt a NoSQL database:
- the need of having a greater scalability
- the preference for free and open source software over commercial database products
- specialised query operations that are not well supported by the relational model
- frustration with the restrictiveness of relational schemas and a desire for a more dynamic and expressive data model
Object relational impedance mismatch – this is actually a fancy way of saying: “Yes, it is difficult getting stuff into and out of the database”. The purpose of an ORM is to solve this issue or at least to make the things easier in translating the entities from database to objects (when working with OOP, but it can’t completely hide the differences between the two models.
The JSON model is seen by some developers as a model that reduces the impedance mismatch between the application code and the storage layer, but even so it is not perfect. A thing is sure, the JSON representation has better locality than the multi-table schema because if you want to fetch some data from a relational model you will need to join the info from multiple tables, but with the JSON you could store the data in one single place.
When it comes to representing many-to-one and many-to-many relationships, relational and document databases are not fundamentally different: in both cases, the related item is referenced by a unique identifier, which is called a foreign key in the relational model and a document reference in the document model . That identifier is resolved at read time by using a join or follow-up queries.
PRO arguments for document data model:
- schema flexibility – document databases are sometimes called schemaless, but a more accurate name will be schema-on-read because the structure of the data is implicit, and only interpreted when the data is read. Schema-on-read it is in contrast with schema-on-write which is the traditional approach of relational databases, where the schema is explicit and the database ensures all written data conforms to it. Also the schema-on-read approach is very useful when the items from a collection don’t all have the same structure for some reason.
- better performance due to locality – a document is usually stored as a single continuous string, encoded as JSON, XML or a binary variant. It is also recommended to keep the documents fairly small and avoid writes that increase the size of the document. An aspect to mention here is that the idea of grouping related data together for locality is present not only in the document model: Google’s Spanner database offers the same locality properties in a relational data model, same for Oracle (multi-table index cluster tables), or Bigtable data model with the column-family concept used in Cassandra and HBase.
- for some apps it is closer to the data structures used by the app
PRO arguments for relational data model:
- provide better support for joins, many-to-one and many-to-many relationships
- simplicity of model
- high-security – you could provide access only to a specific set of tables, depending on their access level.
Query Languages for Data
There are multiple query languages for data:
- SQL is a declarative query language for relational model
- MapReduce querying
- “MapReduce is a programming model for processing large amounts of data in bulk across many machines, popularized by Google”
- “MapReduce is neither a declarative query language nor a fully imperative query API, but somewhere in between: the logic of the query is expressed with snippets of code, which are called repeatedly by the processing framework. It is based on the map (also known as collect) and reduce (also known as fold or inject) functions that exist in many functional programming languages.”
- “The map and reduce functions are somewhat restricted in what they are allowed to do. They must be pure functions, which means they only use the data that is passed to them as input, they cannot perform additional database queries, and they must not have any side effects. These restrictions allow the database to run the functions anywhere, in any order, and rerun them on failure. However, they are nevertheless powerful: they can parse strings, call library functions, perform calculations, and more.”
Graph-Like Data Models
A graph consists of two kinds of objects
- vertices (also known as nodes or entities)
- edges (also known as relationships or arcs)
There are 3 main types of data modeled as a graph
- Social graphs – Vertices are people, and edges indicate which people know each other.
- The web graph – Vertices are web pages, and edges indicate HTML links to other pages.
- Road or rail networks – Vertices are junctions, and edges represent the roads or railway lines between them.
There are several different, but related, ways of structuring and querying data in graphs:
- the property graph model (implemented by Neo4j, Titan, and InfiniteGraph)
- the triple-store model (implemented by Datomic, AllegroGraph, and others).
In the property graph model, each vertex consists of:
- A unique identifier
- A set of outgoing edges
- A set of incoming edges
- A collection of properties (key-value pairs)
Each edge consists of:
- A unique identifier
- The vertex at which the edge starts (the tail vertex)
- The vertex at which the edge ends (the head vertex)
- A label to describe the kind of relationship between the two vertices
- A collection of properties (key-value pairs)
Graphs are good for evolvability: as you add features to your application, a graph can easily be extended to accommodate changes in your application’s data structures
The Cypher Query Language
Cypher is a declarative query language for property graphs, created for the Neo4j graph database. (It is named after a character in the movie The Matrix and is not related to ciphers in cryptography)
As is typical for a declarative query language, you don’t need to specify such execution details when writing the query: the query optimizer automatically chooses the strategy that is predicted to be the most efficient, so you can get on with writing the rest of your application.
Triple-Stores and SPARQL
The triple-store model is mostly equivalent to the property graph model and all info is stored in the form of three-part statements: (subject, predicate, object).
The Resource Description Framework (RDF) is a standard model for data publishing and interchange on the Web developed by the W3C. RDF is the standard used in a semantic graph database, also referred to as an RDF triplestore.
SPARQL, pronounced ‘sparkle’, is the standardized query language for RDF, the same way SQL is the standardized query language for relational databases.
When choosing a database, it is important to consider data size, structure, relationships, and how important it is to enforce schemas and ensure consistency.
The key points for each kind of database:
- many-to-many relationships
- data and data relationships need to strictly follow schema
- consistent transactions are important
- hard to scale because relationships are hard to partition effectively
- many-to-many relationships (graph structure)
- fast at following graph edges
- suited to complex network analytics
- less mature technology than Relational
- isolated documents
- retrieve by a key
- documents with different schemas that are easy to update
- easy to scale
Key-value store / object store
- opaque values
- no schema or relationships known to the database
- very simple operations
- easy to scale
- groups related columns for storage (easy to scale)
- memory effective for sparse data
Search engine database
- large amounts of unstructured data
- full text search or fuzzy search service
Time series database
- data is ordered by time
- many data streams
- real time entry ordering functionality
If you want to learn more about system design check the following resources
- A curated list of awesome System Designing articles, books and videos.
- An article about Systems Design for advanced beginners
- More details about SQL vs NoSQL, and also check this talk: “Introduction to NoSQL • Martin Fowler • GOTO 2012”
- An illustrated short article about how do you decide which type of database to use (video version), how is a SQL statement executed in the database, plus the Secret Sauce Behind NoSQL: LSM Tree
- Relational Databases Explained – How Relational Databases Work. This post talks about how indexes and transactions work on the inside of relational databases.