The Pragmatic Engineer

Share this post

Resiliency in Distributed Systems

newsletter.pragmaticengineer.com

Resiliency in Distributed Systems

Two chapters from the book Understanding Distributed Systems by Roberto Vitillo

Gergely Orosz
Sep 28, 2022
33
2
Share this post

Resiliency in Distributed Systems

newsletter.pragmaticengineer.com

Understanding the ins and outs of distributed systems is important for both backend engineers and for anyone working with large-scale systems. Large-scale systems can mean systems with high load and high queries per second (QPS), storing a large amount of data, or ones built with low latency and high reliability. These systems are pretty common across both Big Tech and high-growth startups.

One of the most interesting books I’ve found on this topic is Understanding Distributed Systems. The book was written by Roberto Vitillo, who was a Senior Staff engineer at Mozilla, then a Principal Engineer at Microsoft. The second edition of this book was released in February of this year.

The book is structured in five parts:

  1. Communication. Reliable links, secure links, discovery, APIs.

  2. Coordination. System models, failure detection, time, leader election, replication, coordination avoidance, transactions.

  3. Scalability. HTTP caching, content delivery networks, partitioning, file storage, data storage, caching, microservices, control panes and data panes, messaging.

  4. Resiliency. Common failure causes, redundancy, fault isolation, downstream resiliency, upstream resiliency.

  5. Maintainability. Testing, continuous delivery and deployment, monitoring, observability, and manageability.

I like how the book works its way from the theory needed to understand distributed systems - communication and coordination - to practical topics like scalability and resiliency. The book closes with topics on maintainability, which is an area I found surprisingly little focus with most books.

I reached out to Roberto asking if he’d be open to sharing a few chapters of the book with newsletter readers, and Roberto agreed to do so. I chose two chapters on resiliency, from Part 4. If you’d like to dive deeper into other topics, you can buy the e-book on Roberto’s website, or the print book off Amazon.

In this excerpt, we cover:

1. Downstream resiliency

  • Timeout

  • Retry: exponential backoff, retry amplification

  • Circuit breaker

2. Upstream resiliency

  • Load shedding

  • Load leveling

  • Rate limiting: single process and distributed implementations

  • Constant work

Note that - as always - no links in this newsletter are affiliates and I have not been paid to endorse or recommend this book. More in my ethics statement.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2023 Gergely Orosz
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing