Three Cloud Providers, Three Outages: Three Different Responses

This year, AWS, Azure, and Google Cloud have all suffered comparable regional outages. How did they respond, and why do Azure’s processes stand out compared to its rivals?

Oct 31, 2023

∙ Paid

It’s rare that all three major cloud providers suffer regional outages, but that’s exactly what happened between April and July:

25 April 2023: GCP. A Google Cloud region (europe-west-9) went offline for about a day, and a zone was offline for two weeks (europe-west-9-a.) (incident details). We did a deepdive into this incident in What is going on at Google Cloud?
13 June 2023: AWS. The largest AWS region (us-east-1) degraded heavily for 3 hours, impacting 104 AWS services. A joke says that when us-east-1 sneezes the whole world feels it, and this was true: Fortnite matchmaking stopped working, McDonalds and Burger King food orders via apps couldn’t be made, and customers of services like Slack, Vercel, Zapier and many more all felt the impact. (incident details). We did a deepdive into this incident earlier in AWS’s us-east-1 outage.
5 July 2023: Azure. A region (West Europe) partially went down for about 8 hours due to a major storm in the Netherlands. Customers of Confluent, CloudAmp, and several other vendors running services out of this region suffered disruption. (incident details). We touched on this outage in The Scoop #55: how can a storm damage fiber cables?

A regional outage is rare for any cloud provider because regions are built to be resilient. The fact each major cloud provider suffered one allows us to compare their responses, and take some learnings about best practices.

Today, we cover:

What is a cloud region and how does it differ between cloud providers? A recap.
Communicating during the incident
Preliminary incident details
Incident postmortem and retrospective
Why is AWS so opaque to the public?
Why is Azure stepping up in transparency and accountability?
Lessons for engineering teams from the three cloud providers

As a fun fact: AWS published it’s first-ever public postmortem in two years, possibly thanks to this article being written, and my communication with them — reminding them that not publishing a postmortem for such a major event seems to contradict their own commitment to post-incident reviews.

1. What is a cloud region?

Before we dive in, a quick refresher on what a cloud region is. Definitions vary by cloud provider:

AWS has the strictest definition among all providers, and defines region in the most resilient sense:

“AWS has the concept of a Region, which is a physical location around the world where we cluster data centers. We call each group of logical data centers an Availability Zone (AZ.)
Each AWS Region consists of a minimum of three isolated and physically separate AZs within a geographic area. Unlike other cloud providers, who often define a region as a single data center, the multiple AZ design of every AWS Region offers advantages for customers. Each AZ has independent power, cooling, and physical security and is connected via redundant, ultra-low-latency networks. AWS customers focused on high availability can design their applications to run in multiple AZs to achieve even greater fault-tolerance. AWS infrastructure Regions meet the highest levels of security, compliance, and data protection.”

AWS’s region definition is the most strict across all cloud providers

Azure defines regions more vaguely:

“An Azure region is a set of datacenters, deployed within a latency-defined perimeter and connected through a dedicated regional low-latency network.”

Azure claims to have more regions than any other cloud provider. However, unlike AWS, Azure does not state that two data centers are in physically separate locations, meaning that a region could be run out of a couple of buildings in the same site.

An Azure region consists of several "unique" physical buildings (data centers). It's not clear what "unique" means in terms of distance from one another.

Google Cloud has the loosest definitions of a region and a zone, where two zones could run from a single physical data center and be separated only logically:

“Regions are independent geographic areas that consist of zones.
Zones and regions are logical abstractions of underlying physical resources provided in one or more physical data centers. These data centers may be owned by Google and listed on the Google Cloud locations page, or they may be leased from third-party data center providers.
A zone is a deployment area for Google Cloud resources within a region. Zones should be considered a single failure domain within a region. To deploy fault-tolerant applications with high availability and help protect against unexpected failures, deploy your applications across multiple zones in a region.”

Google Cloud has the most ambiguous decision of a zone. Two zones could be run out of the same physical data center, and still not violate the definition that Google Cloud put in place.

As we will see, this logical separation of two zones that run in the same physical location was the arrangement when a fire at one data center knocked a whole region offline. In that case, deploying applications into separate zones within the same region would still have resulted in a failure.

2. Communicating during the incident

How did the cloud providers communicate during the incidents? A summary:

How each cloud provider communicated during the incident, and how easy incident notes are to find after an outage is resolved

Google Cloud is the only cloud provider that preserved the incident communication log on the incident page, so we can see which updates occurred and when. I like that every status update followed this template:

“Summary: {short summary}
Description: {more details, based on what is known}
{When the next update can be expected}
Diagnosis: {summary if this is known}
Workaround: {possible workaround for customers}”

The incident at Google Cloud was a particularly nasty one; flames spread through a data center hosting the europe-west-9-a zone, and also some clusters from the europe-west-9-c zone. The incident caused the entire europe-west-9 region to be inaccessible for around 14 hours. There was little to tell customers beyond that they needed to fail over to other regions. Google Cloud posted regular updates sharing what they could, such as this, four hours into the incident:

“Summary: We are investigating an issue affecting multiple Cloud services in the europe-west9-a zone
Description: Water intrusion in europe-west9-a led to an emergency shutdown of some hardware in that zone. There is no current ETA for recovery of operations in europe-west9-a, but it is expected to be an extended outage. Customers are advised to fail over to other zones if they are impacted.
We will provide an update by Wednesday, 2023-04-26 00:30 US/Pacific with current details.
We apologize to all who are affected by the disruption.
Diagnosis: Customers may be unable to access Cloud resources in europe-west9-a
Workaround: Customers can fail over to other zones.”

After the first few hours, updates became pretty much copy-paste and uninformative. However, GCP kept providing them, and was clear about when the next update was due.

While I appreciate continuous updates, these were unnecessarily verbose and robotic in tone; as if someone was resending the same template every 30 minutes. It was also hard to tell when an update contained new information. Despite that, it’s better to send updates and make them visible after an incident, than to not send them, or remove them post-incident.

AWS did the best job of sharing concise, clear and frequent-enough updates among all the cloud providers. Here are the updates from the first hour of the incident:

“[5 June 2023] 12:08 PM PDT We are investigating increased error rates and latencies in the US-EAST-1 Region.
12:19 PM PDT AWS Lambda function invocation is experiencing elevated error rates. We are working to identify the root cause of this issue.
12:26 PM PDT We have identified the root cause of the elevated errors invoking AWS Lambda functions, and are actively working to resolve this issue.
12:36 PM PDT We are continuing to experience increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. We have identified the root cause as an issue with AWS Lambda, and are actively working toward resolution. For customers attempting to access the AWS Management Console, we recommend using a region-specific endpoint (such as: https://us-west-2.console.aws.amazon.com). We are actively working on full mitigation and will continue to provide regular updates.
1:14 PM PDT We are continuing to work to resolve the error rates invoking Lambda functions. We're also observing elevated errors obtaining temporary credentials from the AWS Security Token Service, and are working in parallel to resolve these errors.
1:38 PM PDT We are beginning to see an improvement in the Lambda function error rates. We are continuing to work towards full recovery.”

Ninety minutes into the incident, an update provided a summary of progress:

“1:48 PM PDT Beginning at 11:49 AM PDT, customers began experiencing errors and latencies with multiple AWS services in the US-EAST-1 Region. Our engineering teams were immediately engaged and began investigating. We quickly narrowed down the root cause to be an issue with a subsystem responsible for capacity management for AWS Lambda, which caused errors directly for customers (including through API Gateway) and indirectly through the use by other AWS services. We have associated other services that are impacted by this issue to this post on the Health Dashboard.
Additionally, customers may experience authentication or sign-in errors when using the AWS Management Console, or authenticating through Cognito or IAM STS. Customers may also experience intermittent issues when attempting to call or initiate a chat to AWS Support.
We are now observing sustained recovery of the Lambda invoke error rates, and recovery of other affected AWS services. We are continuing to monitor closely as we work towards full recovery across all services.”

And an update inside the final hour and a half of the outage:

“2:00 PM PDT Many AWS services are now fully recovered and marked Resolved on this event. We are continuing to work to fully recover all services.
2:29 PM PDT Lambda synchronous invocation APIs have recovered. We are still working on processing the backlog of asynchronous Lambda invocations that accumulated during the event, including invocations from other AWS services (such as SQS and EventBridge). Lambda is working to process these messages during the next few hours and during this time, we expect to see continued delays in the execution of asynchronous invocations.
2:49 PM PDT We are working to accelerate the rate at which Lambda asynchronous invocations are processed, and now estimate that the queue will be fully processed over the next hour. We expect that all queued invocations will be executed.”

Azure is the only cloud provider whose updates from during the incident cannot be viewed after it was resolved. I tracked this incident at the time and there were regular updates. However, today there is no paper trail to see their contents.

There is plenty to like about how Azure handles public communications, but it is the only major cloud provider that makes incident notes inaccessible to public view after an outage is resolved.

3. Preliminary incident details

With any incident, mitigation comes first. An investigation starts after everything is back to normal and customers can use a service as normal. This investigation can be time consuming, so it’s hard to tell customers exactly when to expect more details. However, in these cases we aren’t talking about a small engineering team with a few users; but the largest cloud providers in the world upon whom tens of thousands of businesses rely. And these businesses expect some full details when a full investigation is complete. Here’s how the providers compared:

How the cloud providers did in making preliminary incident reviews public

AWS was extremely fast in providing a summary of the incident, and did so in just 5 minutes (!!) after the incident was mitigated. Below is the update added to the status page:

3:42 PM PDT Between 11:49 AM PDT and 3:37 PM PDT, we experienced increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. Our engineering teams were immediately engaged and began investigating. We quickly narrowed down the root cause to be an issue with a subsystem responsible for capacity management for AWS Lambda, which caused errors directly for customers (including through API Gateway) and indirectly through the use of other AWS services. Additionally, customers may have experienced authentication or sign-in errors when using the AWS Management Console, or authenticating through Cognito or IAM STS.
Customers may also have experienced issues when attempting to initiate a Call or Chat to AWS Support. As of 2:47 PM PDT, the issue initiating calls and chats to AWS Support was resolved.
By 1:41 PM PDT, the underlying issue with the subsystem responsible for AWS Lambda was resolved. At that time, we began processing the backlog of asynchronous Lambda invocations that accumulated during the event, including invocations from other AWS services. As of 3:37 PM PDT, the backlog was fully processed. The issue has been resolved and all AWS Services are operating normally.”

Call me amazed. Credit where it is due, this communication was fast and on point. Unfortunately, it was also almost the last time AWS communicated anything publicly about this outage.

Google Cloud posted a preliminary incident review 14 days after the regional outage was resolved, at the time when the europe-west-9-a zone was still down. The fact that the zonal outage lasted 14 days makes it a little tricky to judge whether Google Cloud posted the preliminary report one day after the full outage was resolved, or 14 days after the regional outage was mitigated.

I am leaving the 14 days here, because a regional outage is far wider-reaching than a zonal outage, and Google did leave customers who were dependent on the europe-west-9 region – but not the europe-west-9-a zone – waiting two weeks for preliminary details.

Azure is the only cloud provider that publishes timelines on preliminary PIRs (production incident reviews) – and for full incident reviews:

“We endeavor to publish a ‘Preliminary’ PIR within 3 days of incident mitigation, to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a ‘Final’ PIR with additional details/learnings.”

In this case, I remember Azure did publish such a review, and it was within the 3-day timeframe. However, after the final review was published the preliminary review was no longer publicly available.

4. Incident postmortem and retrospective

Until this post, the 3 cloud providers did comparably well at handling their incidents. Post-incident is where things diverged significantly:

The Pragmatic Engineer