This year, AWS, Azure, and Google Cloud have all suffered comparable regional outages. How did they respond, and why do Azure’s processes stand out compared to its rivals?
If I remember correctly, the AWS console was intermittently reachable (not at all for some people). Making manual fail overs from Route53 extremely difficult, not to mention accessing the health dashboard.
Gergely, when you say that all providers promptly acknowledged the incident, how do you define promptly? Where I work, we often have the status page updated to reflect an incident about 15 minutes after it starts. This is the time it takes set up an incident call, understand the user impact, draft communications to customers, and have that draft approved. Some users / customers complain that this is too slow. I'm curious where the industry standard is at here. Would you say that this is a good response time or not?
@Brian: this is actually a good point, and one I kind of hand-waved! For the two cloud providers we have logs for: they took 20-30 minutes to acknowledge the issue from starting. AWS took 20 minutes, and GCP took 30 minutes. Let me update this.
If I remember correctly, the AWS console was intermittently reachable (not at all for some people). Making manual fail overs from Route53 extremely difficult, not to mention accessing the health dashboard.
Great article comparing the different providers in this way! I enjoyed the cheesy joke at the end, it was a nice personal touch.
Gergely, when you say that all providers promptly acknowledged the incident, how do you define promptly? Where I work, we often have the status page updated to reflect an incident about 15 minutes after it starts. This is the time it takes set up an incident call, understand the user impact, draft communications to customers, and have that draft approved. Some users / customers complain that this is too slow. I'm curious where the industry standard is at here. Would you say that this is a good response time or not?
@Brian: this is actually a good point, and one I kind of hand-waved! For the two cloud providers we have logs for: they took 20-30 minutes to acknowledge the issue from starting. AWS took 20 minutes, and GCP took 30 minutes. Let me update this.