The observability provider was down for more than a day in March. What went wrong, how did the engineering team respond, and what can businesses learn from the incident? Exclusive.
Thanks for detailed explanation. I very like how changelog of Linux update analyzed with description of os specific terms and utilities. Thanks for sharing this. This should help many organizations improve infra setup!
A couple of engineers from Datadog had a great talk recently (https://www.usenix.org/conference/srecon23americas/presentation/malla) where "interesting" network handling by Cilium entered into the problem too.
I don't read all issues, but I want to tell you how much I've liked this one. Thank you!
Thanks for the details, Gergely! Wonder if the name of the “legacy security update channel” is “unattended-upgrades”. I remember mitigating a similar incident (at a much smaller company) caused by this seemingly innocent tool.