Cybersecurity vendor CrowdStrike shipped a routine rule definition change to all customers, and chaos followed as 8.5M machines crashed, worldwide. There are plenty of learnings for developers.
When joining Google I was surprised that there are hundreds or more SWEs working on systems for progressive rollouts of binaries or features flags or canary analysis. It’s very satisfying to see in practice after working with such scale how important it is. Thousands of small changes are targeting some small partitions of traffic and some are reverted automatically. It’s unimaginable for such type of company to not have it in place.
> The instruction that crashed is the Assembly instruction “mov r9d, [r8].” This instructs to move the bytes in the r9d address to the r8 one. The problem is that r8 is an unmapped address (invalid), and so the process crashes!
I think you have it backwards. I think `mov r9d, [r8]` is destination/source, not source/destination (as your text claims).
I'd be interested in an additional dimension to the advice above that handles the necessary speed for security updates. In security, if a vulnerability is found, it's a P0 to close the vulnerability. For processes such as those by Crowdstrike, then the concept of slower rollouts is harder than the concept of slower rollouts for binaries. Time is far more of the essence.
As a specific point, canarying a denylist over 4 days for an active attack is an absolute no-go, very different from canarying a configuration of supported types. What do you do then? I was thinking you do a fast canary (0-1-100) with rigorous testing of the engine.
P.S. I'm very interested in the testing of the engine (not the validator), because theoretically this bug has been around for _months_.
The one area that needs to get explored further is how the health signal of the canary test works in this scenario - especially, as has been pointed out, with security updates speed of rollout to respond to critical threads is a competing consideration.
If you roll out 10%, then the screens at LGA baggage claim go dark, not a lot of folks call maintenance to tell them, if at night, it would take some time for IT to show up the next day, even longer to diagnose and eventually identify Crowdstrike, and then forward that info in a meaningful way (see the comment that in the Linux incident 'they requested more information').
Because the system crashed, it cannot call home and say this failed. So the canary would actually need a positive heartbeat health report that called home after the update and said 'updated and all good'. Then the absence of such heartbeat signals would have to be interpreted as a failure. However, not all heartbeats will get through, so you have to account for false negatives.
In summary - yes, canary roll-out is a good method. But only if the monitoring is easy, timely, and reliable - including in infrastructure deployments, not just office worker deployments. That's big wall to climb that seems to be higher than Crowdstrike was prepared for even before this incident.
When joining Google I was surprised that there are hundreds or more SWEs working on systems for progressive rollouts of binaries or features flags or canary analysis. It’s very satisfying to see in practice after working with such scale how important it is. Thousands of small changes are targeting some small partitions of traffic and some are reverted automatically. It’s unimaginable for such type of company to not have it in place.
> The instruction that crashed is the Assembly instruction “mov r9d, [r8].” This instructs to move the bytes in the r9d address to the r8 one. The problem is that r8 is an unmapped address (invalid), and so the process crashes!
I think you have it backwards. I think `mov r9d, [r8]` is destination/source, not source/destination (as your text claims).
Hugh: you're right. Thank you! I both checked the update and re-read the MOV instruction parameter definition.
I'd be interested in an additional dimension to the advice above that handles the necessary speed for security updates. In security, if a vulnerability is found, it's a P0 to close the vulnerability. For processes such as those by Crowdstrike, then the concept of slower rollouts is harder than the concept of slower rollouts for binaries. Time is far more of the essence.
As a specific point, canarying a denylist over 4 days for an active attack is an absolute no-go, very different from canarying a configuration of supported types. What do you do then? I was thinking you do a fast canary (0-1-100) with rigorous testing of the engine.
P.S. I'm very interested in the testing of the engine (not the validator), because theoretically this bug has been around for _months_.
The one area that needs to get explored further is how the health signal of the canary test works in this scenario - especially, as has been pointed out, with security updates speed of rollout to respond to critical threads is a competing consideration.
If you roll out 10%, then the screens at LGA baggage claim go dark, not a lot of folks call maintenance to tell them, if at night, it would take some time for IT to show up the next day, even longer to diagnose and eventually identify Crowdstrike, and then forward that info in a meaningful way (see the comment that in the Linux incident 'they requested more information').
Because the system crashed, it cannot call home and say this failed. So the canary would actually need a positive heartbeat health report that called home after the update and said 'updated and all good'. Then the absence of such heartbeat signals would have to be interpreted as a failure. However, not all heartbeats will get through, so you have to account for false negatives.
In summary - yes, canary roll-out is a good method. But only if the monitoring is easy, timely, and reliable - including in infrastructure deployments, not just office worker deployments. That's big wall to climb that seems to be higher than Crowdstrike was prepared for even before this incident.
I would categorize this incident as a unique type of software supply chain "attack" (damage)