The Pulse: AWS takes down a good part of the internet
On Monday, a major AWS outage hit thousands of sites & apps, and even a Premier League soccer game. An overview of what caused this high-profile, global outage, and learnings from the incident
Monday was an interesting day: Signal stopped working, Slack and Zoom had issues, and most Amazon services were also down. The cause was a 14-hour-long AWS outage in the us-east-1 region. In this issue of The Pulse, we look deeper into what happened and the causes, covering:
Worldwide impact. From Ring cameras, Robinhood, Snapchat, and Duolingo, all the way to Substack – sites and services went down in their thousands.
Unexpected AWS dependencies. Status pages using Atlassian’s Statuspage product could not be updated, Eight Sleep mattresses were effectively bricked for users, Postman was unusable, UK taxpayers couldn’t access the HMRC portal, and a Premier League game was interrupted.
What caused the outage? A DNS race condition at DynamoDB, Amazon EC2 getting into a congestive collapse, and network propagation issues all stretched the outage to 14 hours.
Why such dependency on us-east-1? It feels like half of the internet is on us-east-1 for its low pricing and high capacity. Meanwhile, some AWS services are themselves dependent on this region.
There’s a joke that when aws us-east-1 sneezes, the world catches a cold. It’s the most popular region to deploy services in; for many companies it’s the default region of choice. It’s also the case that aws-us-east-1 is the largest and oldest of AWS’s regions and the company itself has several dependencies located there. A us-east-1 region-wide outage usually results in AWS-wide issues which can affect many websites, apps, and infrastructure companies.
1. Worldwide impact
On Monday, 20 October, an outage in this region began at around 3am EST (9am CEST), and lasted a total of 14 hours. Sites and apps impacted:

