When designing systems/services is about combining functional requirements (what the system should do) with non-functional requirements (quality attributes or ilities). The challenge here is to develop software with the right quality levels. And to solve this challenge, here comes the concept of Software Architecture Methodology which helps us to build a bridge between the problem space and the solution space.
My role, as an Engineering Manager, is also about technical leadership and helping my teams making the right technical decisions based on data, best practices and current expectations from our clients. And since I just finished to read “Building a Second Brain”, I include here my notes from the book “Designing Data-Intensive Applications” by Martin Kleppmann, a must-read in the system design field.
Designing data systems or services is about trying to identify answers for a set of questions like:
- How do you ensure that the data remains correct and complete, even when things go wrong internally?
- How do you provide consistently good performance to clients, even when parts of your system are degraded?
- How do you scale to handle an increase in load?
- What does a good API for the service look like?
From a long list of qualities, in this article we will cover three of them: reliability, scalability, maintainability.
The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error).
The checklist of a reliable system:
- it offers the function that the user expected
- it tolerates the user making mistakes or using it in unexpected ways
- it has a good enough performance under the expected load and data volume
- it is secure (prevents any unauthorised access)
Reliability = “continuing to work correctly, even when things go wrong”
Fault vs Failure
Fault = one component of the system deviating from its spec; there are hardware (typically random and uncorrelated), software (bugs), and human.
Failure = the system as a whole stops providing the required service to the user
How to make a system reliable:
- minimise the risk for error ⇒ well-designed abstractions
- have separate environments ⇒ dev, stage, pre-prod, prod
- testing, testing, testing ⇒ unit tests, integration tests, performance tests, end2end testing
- deployment pipelines ⇒ rollback procedure, canary releases, gradually releases
- monitoring and alerting ⇒ telemetry
- practices and process ⇒ incident management process, operational readiness checklists, coding guidelines, branching strategies
As the systems grows there should be reasonable ways of dealing with that growth (load).
To address the scalability is about having answers at the next questions:
- If the system grows in a particular way, what are our options for coping with the growth?
- How can we add computing resources to handle the additional load?
But before finding answers at the questions, the load parameters should be identified. They could be requests per second, read-write ratio, active users, hit rate on a cache.
After the load parameters are in place, the next step is to describe the performance: when a load parameter is increased, how much do we need to increase the resources so we can keep performance unchanged?
Latency vs Response Time
“Latency and response time are often used synonymously, but they are not the same. The response time is what the client sees: besides the actual time to process the request (the service time), it includes network delays and queueing delays. Latency is the duration that a request is waiting to be handled—during which it is latent, awaiting service.” (Martin Fowler: Patterns of Enterprise Application Architecture)
SLA vs SLO vs SLI
An SLA (service level agreement) = the agreement you make with your clients and users
An SLO (service level objective) = the objectives your team must hit to meet that agreement
An SLI (service level indicator) = the real numbers on your performance
High percentiles become especially important in backend services that are called multiple times as part of serving a single end-user request.
Tail latency amplification = when an end-user request requires multiple backend calls and having multiple users results in having slow responses
How to scale a system:
- scaling up or vertical scaling ⇒ increase the performance of the machine
- scaling out or horizontal scaling ⇒ have multiple machines that will process the load
- elastic systems ⇒ if load is highly unpredictable the systems can automatically add computing resources when they detect a load increase
The people that are working on the system should be able to work on it productively.
This quality of the system is not easy to achieve, but also not impossible. It is about having a set of KPIs, automated process to monitor those KPIs and also building in your team a quality mindset (clean code guidelines, testing, automation, prioritising tech debt, etc).