Resiliency in Distributed Systems
Two chapters from the book Understanding Distributed Systems by Roberto Vitillo
Understanding the ins and outs of distributed systems is important for both backend engineers and for anyone working with large-scale systems. Large-scale systems can mean systems with high load and high queries per second (QPS), storing a large amount of data, or ones built with low latency and high reliability. These systems are pretty common across both Big Tech and high-growth startups.
One of the most interesting books I’ve found on this topic is Understanding Distributed Systems. The book was written by Roberto Vitillo, who was a Senior Staff engineer at Mozilla, then a Principal Engineer at Microsoft. The second edition of this book was released in February of this year.
The book is structured in five parts:
Communication. Reliable links, secure links, discovery, APIs.
Coordination. System models, failure detection, time, leader election, replication, coordination avoidance, transactions.
Scalability. HTTP caching, content delivery networks, partitioning, file storage, data storage, caching, microservices, control panes and data panes, messaging.
Resiliency. Common failure causes, redundancy, fault isolation, downstream resiliency, upstream resiliency.
Maintainability. Testing, continuous delivery and deployment, monitoring, observability, and manageability.
I like how the book works its way from the theory needed to understand distributed systems - communication and coordination - to practical topics like scalability and resiliency. The book closes with topics on maintainability, which is an area I found surprisingly little focus with most books.
I reached out to Roberto asking if he’d be open to sharing a few chapters of the book with newsletter readers, and Roberto agreed to do so. I chose two chapters on resiliency, from Part 4. If you’d like to dive deeper into other topics, you can buy the e-book on Roberto’s website, or the print book off Amazon.
In this excerpt, we cover:
1. Downstream resiliency
Timeout
Retry: exponential backoff, retry amplification
Circuit breaker
2. Upstream resiliency
Load shedding
Load leveling
Rate limiting: single process and distributed implementations
Constant work
Note that - as always - no links in this newsletter are affiliates and I have not been paid to endorse or recommend this book. More in my ethics statement.
1. Downstream resiliency (Chapter 27)
Now that we have discussed how to reduce the impact of faults at the architectural level with redundancy and partitioning, we will dive into tactical resiliency patterns that stop faults from propagating from one component or service to another. In this chapter, we will discuss patterns that protect a service from failures of downstream dependencies.
Timeout
When a network call is made, it’s best practice to configure a timeout to fail the call if no response is received within a certain amount of time. If the call is made without a timeout, there is a chance it will never return, and as mentioned in chapter 24, network calls that don’t return lead to resource leaks. Thus, the role of timeouts is to detect connectivity faults and stop them from cascading from one component to another. In general, timeouts are a must-have for operations that can potentially never return, like acquiring a mutex.
Unfortunately, some network APIs don’t have a way to set a timeout in the first place, while others have no timeout configured by default. For example, JavaScript’s XMLHttpRequest is the web API to retrieve data from a server asynchronously, and its default timeout is zero, which means there is no timeout: