It’s probably known to just about everyone in the world right now that on Friday, July 19 2024, millions of computers went offline unexpectedly due to the software provided by CrowdStrike – a vendor specializing in cybersecurity tools. Many have asked for an explanation at a high level as to what happened an why, so let’s dive into this topic. Settle in, this is going to be a long one.
Editor’s Note for Disclosure: While the author works for an organization which offers sales and services around CrowdStrike products, they also offer such sales and services for a wide variety of other EDR/XDR solutions. As such, objectivity can be preserved.
First, some background information:
CrowdStrike is a well-known and well-respected vendor in the cybersecurity space. They offer a large range of products and services to help businesses with everything from anti-malware defenses to forensic investigations after a cyber attack occurs. For the most part, their software works exceptionally well and their customers are typically happy with them as a company.
Endpoint Detection and Response (EDR) is the general term for any software that looks both for known malware files on a computer and also looks at what things are actively running on a computer to attempt to determine if they may be some form of yet-unknown malware. These operations are often referred to as “signature/heuristic scanning” and “behavioral detection” respectively. While it isn’t necessary to understand the ins and outs of how this stuff works to understand what happened on Friday, CrowdStrike has a product line (Falcon XDR) which does both signature scanning and behavioral detection.
EDR solutions have two forms of updates that they regularly get delivered and installed. The first type is one most of us are familiar with, application updates. This is when a vendor needs to update the EDR software itself, much like how Windows receives patches and updates. In the case of an application update, it is the software itself being updated to a new version. These updates are infrequent, and only released when required to correct a software issue or deploy a new feature-set.
The second form of update is policy or definition updates (vendors use different terms for these, we will use “definition updates” for this article). Unlike application updates, definition updates do not change how the software works – they only change what the EDR knows to look for. As an example, every day there are new malicious files discovered in the world. So every day, EDR vendors prepare and send new definitions to allow their EDR to recognize and block these new threats. Definition updates happen multiple times per day for most vendors as new threat forms are discovered, analyzed, and quantified.
The other term that was heard a lot this weekend was “kernel mode.” This can be a bit complex, but it helps if you visualize your operating system (Windows, MacOS, Linux, etc.) as a physical brick-and-mortar retail store. Most of what the store does happens in the front – customers buy things, clerks stock items, cash is received, credit cards are processed for payment. There are some things, like the counting of cash and the receiving of new stock, that are done in the back office because they are sensitive enough that extra control has to be enforced on them. In a computer operating system, user space is the front of the store where the majority of things get done. Kernel space is the back office, where only restricted and sensitive operations occur. By their nature, EDR solutions run some processes in the kernel space; since they require the ability to view, analyze, and control other software. While this allows an EDR to do what it does, it also means that errors or issues that would not create major problems if they were running in user space can create truly massive problems as they are actually running in kernel space.
OK, with all that taken care of… what happened on Friday?
Early in the morning (UST), CrowdStrike pushed a definition update to all devices running their software on Windows operating systems. This is a process that happens many times a day, every day, and would not normally produce any kind of problems. After all, the definition update isn’t changing how the software works or anything like that. This update, however, had a flaw which set the stage for an absolute disaster.
Normally, any changes to software in an enterprise environment (like airlines, banks, etc.) would go through a process called a “staged rollout” – the update is tested in a computer lab, then rolled out to low-impact systems that won’t disrupt business if something goes wrong. Then, and only then, it goes out to all the other systems once the company is sure that it won’t cause trouble. For CrowdStrike application updates, this process happens like any other software update, and they are put through a staged rollout process. However, definition updates are not application updates, and because of both the frequency of definition updates and the nature of their data (supplying new detection methods), they are not subject to staged rollout by the customer. In fact, customers rarely have even the ability to subject definition updates to staged rollouts themselves – the feature just doesn’t exist in nearly all EDR platforms. There are several EDR vendors who do staged rollouts to their customers, but once a definition update is pushed, it is installed immediately for every customer in that phase of the rollout. CrowdStrike pushed this update out to over 8 million systems in a matter of minutes.
This particular definition update had a massive issue. The update itself was improperly coded, which made it attempt to read an area of memory that couldn’t exist. In user space, this problem would just cause the application to crash, but not have any other impact on the system. In kernel space, however, an error of this type can cause the system itself to crash, since in kernel space the “application” is – essentially – the operating system itself. This meant that every machine which attempted to apply the definition update (over 8.5 million at last count) crashed immediately.
To recover from this issue, a machine would need to be booted into Safe Mode – a special function of Windows operating systems that boots up the machine with the absolute bare minimum of stuff running. No 3rd-party applications, no non-essential Windows applications and features, etc. Once booted into Safe Mode, the offending update file could be deleted and the machine rebooted to return to normal.
So, why did it take days to make this happen if you just had to reboot into Safe Mode and delete a file? Well, there are two reasons this was a problem:
First, Safe Mode booting has to be done manually. On every single impacted device. When we may be talking about tens of thousands of devices in some companies, just the manpower needed to manually perform this process on every single machine is staggering.
Second, if the machine is using BitLocker (Microsoft’s disk encryption technology) – which they all absolutely should be using, and the majority were using – then a series of steps must be performed to unlock the disk that holds the Windows operating system before you can boot into Safe Mode and fix the problem. This series of steps is also very manual and time consuming, though in the days following the initial incident there were some methods discovered that could make it faster. Again, when applied against tens of thousands of devices, this will take a massive amount of people and time.
Combined, the requirement to manually boot into Safe Mode after performing the steps to unlock the drive led to company IT teams spending 72 hours and longer un-doing this bad definition update across their organizations. All the while, critical systems which are required to run the business and service customers were offline entirely. That led to the situation we saw this weekend, with airlines, stores, banks, and lots of other businesses being unable to do anything but move through the recovery process as quickly as they could – but it was still taking a long time to accomplish. Of course this led to cancelled flights, no access to government and business services, slow-downs or worse in healthcare organizations, etc. These operations slowly started coming back online over the weekend, with more still being fixed as I write this on Monday.
Now that we’ve got a good handle on what went wrong, let’s answer some other common questions:
“Was this a cyber attack?” No. This was a cyber incident, but does not show any evidence that it was an attack. Incidents are anything that causes an impact to a person or business, and this definitely qualifies. Attacks are purposeful, malicious actions against a person or business, and this doesn’t qualify as that. While the potential that this was threat activity can not yet be entirely ruled out, there are no indications that any threat actor was part of this situation. No group claimed responsibility, no ransom was demanded, no data was stolen. The incident also was not targeted, systems that were impacted were just online when the bad update was available, and therefore it was pretty random. This view may change in future as more details become available, but as of today this does not appear to be an attack.
“Why did CrowdStrike push out an update on a Friday, when there would be less people available to fix it?” The short answer is that definition updates are pushed several times a day, every day. This wasn’t something that was purposely pushed on on Friday specifically, it was just bad luck that the first update for Friday AM had the error in it.
“How did CrowdStrike not know this would happen? Didn’t they test the update?” We don’t know just yet. While we now know what happened, we do not yet have all the details on how it happened. It would be expected that such information will be disclosed or otherwise come to light in the coming weeks.
“Why was only Windows impacted?” Definition updates for Windows, MacOS, and Linux are created, managed, and delivered through different channels. That is something that is common for most EDR vendors. This update was only for Windows, so only Windows systems were impacted.
“Was this a Microsoft issue?” Yes and no, but in every important way no. It was not actually Microsoft’s error, but since it only impacted Windows systems it was a Microsoft problem. Microsoft was not responsible for causing the problem, or responsible for fixing it, though they did offer whatever support and tools they could to help, and continue to do so.
“Couldn’t companies test the update before it rolled out?” No, not in this case. The ability to stage the rollout of definition updates is not generally available in EDR solutions (CrowdStrike or other vendors) – though after this weekend, that might be changing. There are very real reasons why such features aren’t available, but with the issues we just went through, it might be time to change that policy.
“How can we stop this from ever happening again?” The good news is that many EDR vendors stage the rollout of definition updates across their customers. So while a customer cannot stage the rollouts themselves, at least only a limited number of customers will be impacted by a bad update. No doubt CrowdStrike will be implementing this policy in the very near future. The nature and urgency of definition updates makes traditional staging methods unusable as organizations cannot delay updates for weeks as they do with Windows updates and other application updates. That being said, some method of automated staging of definition updates to specific groups of machines – while truly not optimal – might be necessary in future.
To sum up, CrowdStrike put out a definition update with an error in it, and because this definition update was loaded into a kernel-mode process, it crashed Windows. Over 8.5 million such Windows machines downloaded and applied the update before the error was discovered, causing thousands of businesses to be unable to operate until the situation was corrected. That correction required manual and time-consuming operations to be performed machine by machine, so the process took (and continues to take) a significant amount of time. No data theft or destruction occurred (beyond what would normally happen during a Windows crash), no ransom demanded, no responsibility beyond CrowdStrike claimed. As such, it is highly unlikely that this was any form of cyber attack; but it was definitely a cyber incident since a huge chunk of the business world went offline.