Resiliency in Distributed Systems
Two chapters from the book Understanding Distributed Systems by Roberto Vitillo
Understanding the ins and outs of distributed systems is important for both backend engineers and for anyone working with large-scale systems. Large-scale systems can mean systems with high load and high queries per second (QPS), storing a large amount of data, or ones built with low latency and high reliability. These systems are pretty common across both Big Tech and high-growth startups.
One of the most interesting books I’ve found on this topic is Understanding Distributed Systems. The book was written by Roberto Vitillo, who was a Senior Staff engineer at Mozilla, then a Principal Engineer at Microsoft. The second edition of this book was released in February of this year.
The book is structured in five parts:
Communication. Reliable links, secure links, discovery, APIs.
Coordination. System models, failure detection, time, leader election, replication, coordination avoidance, transactions.
Scalability. HTTP caching, content delivery networks, partitioning, file storage, data storage, caching, microservices, control panes and data panes, messaging.
Resiliency. Common failure causes, redundancy, fault isolation, downstream resiliency, upstream resiliency.
Maintainability. Testing, continuous delivery and deployment, monitoring, observability, and manageability.
I like how the book works its way from the theory needed to understand distributed systems - communication and coordination - to practical topics like scalability and resiliency. The book closes with topics on maintainability, which is an area I found surprisingly little focus with most books.
I reached out to Roberto asking if he’d be open to sharing a few chapters of the book with newsletter readers, and Roberto agreed to do so. I chose two chapters on resiliency, from Part 4. If you’d like to dive deeper into other topics, you can buy the e-book on Roberto’s website, or the print book off Amazon.
In this excerpt, we cover:
1. Downstream resiliency
Timeout
Retry: exponential backoff, retry amplification
Circuit breaker
2. Upstream resiliency
Load shedding
Load leveling
Rate limiting: single process and distributed implementations
Constant work
Note that - as always - no links in this newsletter are affiliates and I have not been paid to endorse or recommend this book. More in my ethics statement.