4 Comments

Awaiting the Shopify engineering blog where the cache causes a major outage, because that's what Redis and Memcached nearly always do when they're used to paper over a design that doesn't scale.

Expand full comment
author

@Levi I agree that now the cache is a point of failure. However, I'd argue that even if they added some standalone service that orchestrates queries and caches, it's still a point of failure... though perhaps one you have more strings to pull to protect against it.

I do wonder though if they have a fallback for worker machines to go straight to the DB with cache outages. If so, even with an outage it's just a latency issue.

Another failure scenario that could occur is cache corrupted.

What alternatives would have you looked at over slapping the cache in front of those services?

Expand full comment

The bigger threat is described by Marc Brooker: https://brooker.co.za/blog/2021/08/27/caches.html

If a cache enables a system to handle more load than the system could handle without it, then the cache failing can easily put the system into a failed state that's impossible to gracefully recover from (going straight to the DB in cache outages drives this state).

I'd strongly consider what Google (see Adya 2019) calls a Local In-Memory Key Value store (in contrast to the Remote In-Memory Key Value store like Memcached/Redis) where the service itself is the cache and the DB only exists to allow the cache to be durable. You get better sunny-day latency than going with a remote KV, remove cache coherency. The service is now stateful (it's no longer offloading all responsibility for state to the cache and the DB), which allows it to combine requests (e.g. those which arise from mobile app users repeatedly swiping down to refresh) in order to reduce DB load.

Expand full comment

Thanks for sharing the article.

Expand full comment