7 Comments
Jul 24Liked by Gergely Orosz

When joining Google I was surprised that there are hundreds or more SWEs working on systems for progressive rollouts of binaries or features flags or canary analysis. It’s very satisfying to see in practice after working with such scale how important it is. Thousands of small changes are targeting some small partitions of traffic and some are reverted automatically. It’s unimaginable for such type of company to not have it in place.

Expand full comment
Jul 23·edited Jul 23Liked by Gergely Orosz

> The instruction that crashed is the Assembly instruction “mov r9d, [r8].” This instructs to move the bytes in the r9d address to the r8 one. The problem is that r8 is an unmapped address (invalid), and so the process crashes!

I think you have it backwards. I think `mov r9d, [r8]` is destination/source, not source/destination (as your text claims).

Expand full comment
author

Hugh: you're right. Thank you! I both checked the update and re-read the MOV instruction parameter definition.

Expand full comment

Excellent analysis, thanks.

That picture of the 120 laptops and technicians standing on ladders to access the machines reminded me of incident after incident through the past decades that required 'manual intervention'.

This is nothing new, people. And it will never end unless we change things up.

I remember arguing vociferously with the Microsoft developers in Windows XP days that they needed to stop ceding control over development of and the system access levels of drivers (sys files particularly) to third party manufacturers--it was basically a terribly lazy move on Microsoft's part, driven entirely by profit motives.

From Windows NT to present, I was involved in huge mass deployments (10K+ to 100K+ systems) of Windows systems and OS refreshments including just about all the major software and every damn PC hardware out there, including stuff literally built in people's garages. 3rd party manufacturers were doing a crap job of developing their drivers, causing multiple blue screens that were difficult to troubleshoot. Some of these failures turned out to be impossible to recover from, requiring complete wipe and loads. This was before cloud storage and smooth restoration of data could be pulled off without a lot of additional effort. 

Those crap drivers were not entirely the device manufacturer's fault, either. Windows was changing too rapidly for them to keep up. Communications between the mother ship and the literally thousands of device manufacturers--who were hardware not software oriented, located in places like Taiwan and China where English wasn't their first language--was never done particularly well. Information flowed inefficiently and was often conflicting and confusing, especially during crises. Updates to manufacturer drivers were spotty at best. But the little guys did what they could, as best they could.

Microsoft finally capitulated, realizing it was affecting their reputation (and good for them). They then did a partial job fixing the issue, feeding in new drivers through Windows Update so there was a centralized distribution system, which helped a lot--but still not doing as thorough a job in actual oversight and serious in-depth testing of the drivers that impacted the system, as they ought to have, in my opinion.

In effect, Crowdstrike is using a 'driver' to access key parts of the OS. That EU lawsuit is fascinating to me--a classic case to be careful what you wish for. It makes Apple's 'garden' looks good by comparison, but only if the lines are clearly drawn and proper transparency on what is the core foundational functionality that will be closely controlled, and what is not, and what the entry points are. EU attacked Apple even more bitterly, remember, for their closed system.

Ultimately, like cars and highways getting air bags and basic safety features, basic functionality needs to be protected, and that means regulations.

I realize developer culture tends to lean libertarian and wild west, so suggesting this is absolute anathema, but we reap what we sow, and lately it seems over and over. The definition of insanity is doing the same thing over and over again. I do think we're making progress--but we're definitely not there yet.

I often had conversations with developers about deployment challenges through the years. In my deployments, I pulled developers, security folks, and the IT infrastructure team into the same room and let them go at it. They rarely talk to one another, much less coordinate.

In the end, you want your code out there safe and secure, so you can deploy rapidly and update frequently to correct the inevitable mistakes and counter the inevitable attacks, all done with nice smooth communications right down to the endpoint about oncoming changes and explaining WHY, so you don't freak out the user base and get flooded with complaints or far worse, generate outright resistance to the update.

Expand full comment

I'd be interested in an additional dimension to the advice above that handles the necessary speed for security updates. In security, if a vulnerability is found, it's a P0 to close the vulnerability. For processes such as those by Crowdstrike, then the concept of slower rollouts is harder than the concept of slower rollouts for binaries. Time is far more of the essence.

As a specific point, canarying a denylist over 4 days for an active attack is an absolute no-go, very different from canarying a configuration of supported types. What do you do then? I was thinking you do a fast canary (0-1-100) with rigorous testing of the engine.

P.S. I'm very interested in the testing of the engine (not the validator), because theoretically this bug has been around for _months_.

Expand full comment

The one area that needs to get explored further is how the health signal of the canary test works in this scenario - especially, as has been pointed out, with security updates speed of rollout to respond to critical threads is a competing consideration.

If you roll out 10%, then the screens at LGA baggage claim go dark, not a lot of folks call maintenance to tell them, if at night, it would take some time for IT to show up the next day, even longer to diagnose and eventually identify Crowdstrike, and then forward that info in a meaningful way (see the comment that in the Linux incident 'they requested more information').

Because the system crashed, it cannot call home and say this failed. So the canary would actually need a positive heartbeat health report that called home after the update and said 'updated and all good'. Then the absence of such heartbeat signals would have to be interpreted as a failure. However, not all heartbeats will get through, so you have to account for false negatives.

In summary - yes, canary roll-out is a good method. But only if the monitoring is easy, timely, and reliable - including in infrastructure deployments, not just office worker deployments. That's big wall to climb that seems to be higher than Crowdstrike was prepared for even before this incident.

Expand full comment

I would categorize this incident as a unique type of software supply chain "attack" (damage)

Expand full comment