The observability provider was down for more than a day in March. What went wrong, how did the engineering team respond, and what can businesses learn from the incident? Exclusive.
Thanks for detailed explanation. I very like how changelog of Linux update analyzed with description of os specific terms and utilities. Thanks for sharing this. This should help many organizations improve infra setup!
Thanks for the details, Gergely! Wonder if the name of the “legacy security update channel” is “unattended-upgrades”. I remember mitigating a similar incident (at a much smaller company) caused by this seemingly innocent tool.
Thanks for detailed explanation. I very like how changelog of Linux update analyzed with description of os specific terms and utilities. Thanks for sharing this. This should help many organizations improve infra setup!
A couple of engineers from Datadog had a great talk recently (https://www.usenix.org/conference/srecon23americas/presentation/malla) where "interesting" network handling by Cilium entered into the problem too.
I don't read all issues, but I want to tell you how much I've liked this one. Thank you!
Thanks for the details, Gergely! Wonder if the name of the “legacy security update channel” is “unattended-upgrades”. I remember mitigating a similar incident (at a much smaller company) caused by this seemingly innocent tool.