CrowdStrike: Incident Response Lessons

Last night my social media channels were jammed with posts related to the CrowdStrike incident. In an unfortunate turn of events, an update pushed out to millions of Windows devices worldwide cause the dreaded Blue Screen of Death, taking countless systems offline.

Watching an event of this type unfold is a great lesson for us all. We get to see the best and worse in people. We see how the leaders of organisations respond and the reaction of the community.

Responding to an incident well

IT outages are not uncommon. They happen frequently with the impact and blast radius going from very small to very large. In this case it was global and almost without comparison in size, but there have been similar issues in New Zealand in recent memory, just at a smaller scale. There are privacy issues, ransomware threats, hardware failures and DNS!

Having an incident response plan is essential. Not just an IT response plan; it is a business response plan. The CEO needs to know what to do as much as the IT Operations team.

One of the first things I noticed about the CrowdStrike response was it appeared to be technically focused. The CEO posted on X with technical detail and how they were responding but forgot something important. The reaction from many was “where is the apology” or “the word sorry is missing”. Now for the Engineers out there, it may seem like unnecessary words, but for the customers it is very important. Several hours later the apology was posted, but by that point perception had already changed for many.

CrowdStrike have promised to be very open and report back on root cause. This is a good move, every person on the Internet is now a C++ expert who understands pointers, null data and has at some point in their life built a kernel level driver. Getting details out quickly helps reduce speculation and longer-term reputational damage. You can see the official CrowdStrike’s response resources and updates on there website. Microsoft have also posted their response resources here.

There are many layers to this issue and in the fullness of time, we will get a much clearer view of what actually happened and the factors that made it possible. There were also some factors that made recovery difficult, particular if Bitlocker was in use and the keys weren’t accessible.

This isn’t the first and won’t be the last time we have an issue of this scale. Make sure your organisation talks about it and most importantly review your existing plan and if you don’t have one, create a new one. (Microsoft Incident Response resources)

BS on the Internet

As I said above, we also see the worst in people. The inevitable speculation that it was somehow caused by an Intern or “Diversity and Inclusion” Hire. Anyone who works in a software team will know that this is complete Bull Shit (I think this is the first time I’ve used the term on my blog). Software of this type isn’t a one-person development effort. There are teams of people developing, reviewing, testing and approving releases to production. If you are allowing one person to do this without oversight and checks then you have much deeper issues.

Frankly, the idea that you can hire people who never make mistakes is misguided at best. This is why we have to develop processes, tools and culture to reduce the risks and even then, we still need to prepare for the perfect team making an unfortunate mistake.

Thank you!

When these issues happen, IT people respond. They work long hours, often in challenging situations to try and resolve issues. The pressure to resolve the issue and get people working again can be challenging. Please given them the space and resources they need to sort the issues out.

If your IT team has been working over the weekend to bring systems back to life, given them a fist bump and a Kitkat (now that is excellent response marketing if I ever saw it). I am sure they will appreciate it.