Cybersecurity in Plain English: What happened with CrowdStrike?

It’s probably known to just about everyone in the world right now that on Friday, July 19 2024, millions of computers went offline unexpectedly Noun crash 4437101 C462DD.due to the software provided by CrowdStrike – a vendor specializing in cybersecurity tools. Many have asked for an explanation at a high level as to what happened an why, so let’s dive into this topic. Settle in, this is going to be a long one. 

Editor’s Note for Disclosure: While the author works for an organization which offers sales and services around CrowdStrike products, they also offer such sales and services for a wide variety of other EDR/XDR solutions. As such, objectivity can be preserved.

First, some background information:

 

CrowdStrike is a well-known and well-respected vendor in the cybersecurity space. They offer a large range of products and services to help businesses with everything from anti-malware defenses to forensic investigations after a cyber attack occurs. For the most part, their software works exceptionally well and their customers are typically happy with them as a company. 

Endpoint Detection and Response (EDR) is the general term for any software that looks both for known malware files on a computer and also looks at what things are actively running on a computer to attempt to determine if they may be some form of yet-unknown malware. These operations are often referred to as “signature/heuristic scanning” and “behavioral detection” respectively. While it isn’t necessary to understand the ins and outs of how this stuff works to understand what happened on Friday, CrowdStrike has a product line (Falcon XDR) which does both signature scanning and behavioral detection. 

EDR solutions have two forms of updates that they regularly get delivered and installed. The first type is one most of us are familiar with, application updates. This is when a vendor needs to update the EDR software itself, much like how Windows receives patches and updates. In the case of an application update, it is the software itself being updated to a new version. These updates are infrequent, and only released when required to correct a software issue or deploy a new feature-set. 

 

The second form of update is policy or definition updates (vendors use different terms for these, we will use “definition updates” for this article). Unlike application updates, definition updates do not change how the software works – they only change what the EDR knows to look for. As an example, every day there are new malicious files discovered in the world. So every day, EDR vendors prepare and send new definitions to allow their EDR to recognize and block these new threats. Definition updates happen multiple times per day for most vendors as new threat forms are discovered, analyzed, and quantified. 

The other term that was heard a lot this weekend was “kernel mode.” This can be a bit complex, but it helps if you visualize your operating system (Windows, MacOS, Linux, etc.) as a physical brick-and-mortar retail store. Most of what the store does happens in the front – customers buy things, clerks stock items, cash is received, credit cards are processed for payment. There are some things, like the counting of cash and the receiving of new stock, that are done in the back office because they are sensitive enough that extra control has to be enforced on them. In a computer operating system, user space is the front of the store where the majority of things get done. Kernel space is the back office, where only restricted and sensitive operations occur. By their nature, EDR solutions run some processes in the kernel space; since they require the ability to view, analyze, and control other software. While this allows an EDR to do what it does, it also means that errors or issues that would not create major problems if they were running in user space can create truly massive problems as they are actually running in kernel space. 

OK, with all that taken care of… what happened on Friday?

Early in the morning (UST), CrowdStrike pushed a definition update to all devices running their software on Windows operating systems. This is a process that happens many times a day, every day, and would not normally produce any kind of problems. After all, the definition update isn’t changing how the software works or anything like that. This update, however, had a flaw which set the stage for an absolute disaster. 

Normally, any changes to software in an enterprise environment (like airlines, banks, etc.) would go through a process called a “staged rollout” – the update is tested in a computer lab, then rolled out to low-impact systems that won’t disrupt business if something goes wrong. Then, and only then, it goes out to all the other systems once the company is sure that it won’t cause trouble. For CrowdStrike application updates, this process happens like any other software update, and they are put through a staged rollout process. However, definition updates are not application updates, and because of both the frequency of definition updates and the nature of their data (supplying new detection methods), they are not subject to staged rollout by the customer. In fact, customers rarely have even the ability to subject definition updates to staged rollouts themselves – the feature just doesn’t exist in nearly all EDR platforms. There are several EDR vendors who do staged rollouts to their customers, but once a definition update is pushed, it is installed immediately for every customer in that phase of the rollout. CrowdStrike pushed this update out to over 8 million systems in a matter of minutes.

This particular definition update had a massive issue. The update itself was improperly coded, which made it attempt to read an area of memory that couldn’t exist. In user space, this problem would just cause the application to crash, but not have any other impact on the system. In kernel space, however, an error of this type can cause the system itself to crash, since in kernel space the “application” is – essentially – the operating system itself. This meant that every machine which attempted to apply the definition update (over 8.5 million at last count) crashed immediately.

To recover from this issue, a machine would need to be booted into Safe Mode – a special function of Windows operating systems that boots up the machine with the absolute bare minimum of stuff running. No 3rd-party applications, no non-essential Windows applications and features, etc. Once booted into Safe Mode, the offending update file could be deleted and the machine rebooted to return to normal. 

So, why did it take days to make this happen if you just had to reboot into Safe Mode and delete a file? Well, there are two reasons this was a problem:

First, Safe Mode booting has to be done manually. On every single impacted device. When we may be talking about tens of thousands of devices in some companies, just the manpower needed to manually perform this process on every single machine is staggering. 

Second, if the machine is using BitLocker (Microsoft’s disk encryption technology) – which they all absolutely should be using, and the majority were using – then a series of steps must be performed to unlock the disk that holds the Windows operating system before you can boot into Safe Mode and fix the problem. This series of steps is also very manual and time consuming, though in the days following the initial incident there were some methods discovered that could make it faster. Again, when applied against tens of thousands of devices, this will take a massive amount of people and time. 

Combined, the requirement to manually boot into Safe Mode after performing the steps to unlock the drive led to company IT teams spending 72 hours and longer un-doing this bad definition update across their organizations. All the while, critical systems which are required to run the business and service customers were offline entirely. That led to the situation we saw this weekend, with airlines, stores, banks, and lots of other businesses being unable to do anything but move through the recovery process as quickly as they could – but it was still taking a long time to accomplish. Of course this led to cancelled flights, no access to government and business services, slow-downs or worse in healthcare organizations, etc. These operations slowly started coming back online over the weekend, with more still being fixed as I write this on Monday.

Now that we’ve got a good handle on what went wrong, let’s answer some other common questions:

“Was this a cyber attack?” No. This was a cyber incident, but does not show any evidence that it was an attack. Incidents are anything that causes an impact to a person or business, and this definitely qualifies. Attacks are purposeful, malicious actions against a person or business, and this doesn’t qualify as that. While the potential that this was threat activity can not yet be entirely ruled out, there are no indications that any threat actor was part of this situation. No group claimed responsibility, no ransom was demanded, no data was stolen. The incident also was not targeted, systems that were impacted were just online when the bad update was available, and therefore it was pretty random. This view may change in future as more details become available, but as of today this does not appear to be an attack. 

“Why did CrowdStrike push out an update on a Friday, when there would be less people available to fix it?” The short answer is that definition updates are pushed several times a day, every day. This wasn’t something that was purposely pushed on on Friday specifically, it was just bad luck that the first update for Friday AM had the error in it. 

“How did CrowdStrike not know this would happen? Didn’t they test the update?” We don’t know just yet. While we now know what happened, we do not yet have all the details on how it happened. It would be expected that such information will be disclosed or otherwise come to light in the coming weeks. 

“Why was only Windows impacted?” Definition updates for Windows, MacOS, and Linux are created, managed, and delivered through different channels. That is something that is common for most EDR vendors. This update was only for Windows, so only Windows systems were impacted.

“Was this a Microsoft issue?” Yes and no, but in every important way no. It was not actually Microsoft’s error, but since it only impacted Windows systems it was a Microsoft problem. Microsoft was not responsible for causing the problem, or responsible for fixing it, though they did offer whatever support and tools they could to help, and continue to do so. 

“Couldn’t companies test the update before it rolled out?” No, not in this case. The ability to stage the rollout of definition updates is not generally available in EDR solutions (CrowdStrike or other vendors) – though after this weekend, that might be changing. There are very real reasons why such features aren’t available, but with the issues we just went through, it might be time to change that policy. 

“How can we stop this from ever happening again?” The good news is that many EDR vendors stage the rollout of definition updates across their customers. So while a customer cannot stage the rollouts themselves, at least only a limited number of customers will be impacted by a bad update. No doubt CrowdStrike will be implementing this policy in the very near future. The nature and urgency of definition updates makes traditional staging methods unusable as organizations cannot delay updates for weeks as they do with Windows updates and other application updates. That being said, some method of automated staging of definition updates to specific groups of machines – while truly not optimal – might be necessary in future.

To sum up, CrowdStrike put out a definition update with an error in it, and because this definition update was loaded into a kernel-mode process, it crashed Windows. Over 8.5 million such Windows machines downloaded and applied the update before the error was discovered, causing thousands of businesses to be unable to operate until the situation was corrected. That correction required manual and time-consuming operations to be performed machine by machine, so the process took (and continues to take) a significant amount of time. No data theft or destruction occurred (beyond what would normally happen during a Windows crash), no ransom demanded, no responsibility beyond CrowdStrike claimed. As such, it is highly unlikely that this was any form of cyber attack; but it was definitely a cyber incident since a huge chunk of the business world went offline. 

Cybersecurity in Plain English: A Special Snowflake Disaster

Editor’s Note: This is an emergent story, and as such there may be more information available after the date of publication. 

Many readers have been asking: “What happened with Snowflake, and why is it making the news?” Let’s dive into this situation, as it is a little more complex than many other large-scale attacks we’ve seen recently.Noun abstract geometric snowflake 2143460 FF001C.

Snowflake is a data management and analytics service provider. What that essentially means is that when companies need to store, manage, and perform intelligence operations on massive amounts of data; Snowflake is one of the larger vendors that has services that allow that to happen. According to SoCRadar [[ https://socradar.io/overview-of-the-snowflake-breach/ ]], in late-May of 2024 Snowflake acknowledged that unusual activity had been observed across their platform since mid-April. While the activity indicated that something wasn’t right, the investigation didn’t find any threat activity being run against Snowflake’s systems directly. This was a bit of a confusing period, as usually you would see evidence that the vendor’s own systems were being attacked when you had strange activity going on across the vendor’s networks. 

Around the time of that disclosure, Santander Bank and Ticketmaster both reported that their data had been stolen, and was being held ransom by a threat actor. These are two enormous companies, and both reporting data breach activity within days of each other is an event that doesn’t happen often. Sure enough, when both companies investigated independently, they both came to the same conclusion – their data in Snowflake was what had been stolen. Many additional disclosures by both victim companies and the threat actors themselves – a group identified as UNC5537 by Mandiant [[ https://cloud.google.com/blog/topics/threat-intelligence/unc5537-snowflake-data-theft-extortion ]] occurred over the following weeks. Most recently, AT&T disclosed that they had suffered a massive breach of their data, with over 7 million customers impacted [[ https://about.att.com/story/2024/addressing-data-set-released-on-dark-web.html ]].

So, was Snowflake compromised? Not exactly. What happened her was that Snowflake did not require that customers use Multi-Factor Authentication (MFA) for users logging into the Snowflake platform. This allowed attackers who were able to successfully get malware on user desktops/devices to grab credentials; and then use those credentials to access and steal that customer’s data in Snowflake. This was primarily done by tricking a user into installing/running an “infostealer” malware, which allowed the attacker to see keystrokes, grab saved credentials, snoop on connections, etc. All the attacker needed to do was infect one machine that was being used by an authorized Snowflake user, and they could then get access to all the data that customer stored in Snowflake. Techniques like the use of password vaults (so there would be no keystrokes to spy on) and the use of MFA (which would require the user acknowledge a push alert or get a code on a different device) would be good defenses against this kind of attack, but Snowflake didn’t require these techniques to be in use for their customers.

Snowflake did not – at least technically – do anything wrong. They allow customers to use MFA and other login/credential security with their service, they just didn’t mandate it. They also did not have a quick way to turn on the requirement for MFA throughout a customer organization if that customer hadn’t started out making it mandatory for all Snowflake accounts they created. This is a point of contention with the cybersecurity community, but even though it is a violation of best practices it is not something that Snowflake purposely did incorrectly. Because of this, the attacks being seen are not the direct fault of Snowflake, but rather a result of Snowflake not forcing customers to use available security measures. Keep in mind that Snowflake has been around for some time now. When they first started, MFA was not an industry standard and customers starting to work with Snowflake back then were unlikely to have enabled it. 

Snowflake themselves have taken steps to address the issue. Most notably, they implemented a setting in their customer administration panel that lets an organization force the use of MFA for everyone in that company. If any users were not set up for MFA, they would need to configure it the next time they logged in. This is a good step in the right direction, but Snowflake did make a few significant errors in the way they handled the situation overall:

 – Snowflake did not enforce cybersecurity best practices by default, even for new customers. While they have been around long enough that their earlier customers may have started using the service before MFA was a standard security control, not getting those legacy customers to enable MFA was definitely a mistake. 

 – They also immediately tried to shift blame to customers who had suffered breaches. The customers in question were responsible for not implementing MFA and/or other security controls to safeguard their data; but attempting to blame the victim rarely works out in an vendor’s favor. In this case, the backlash from the security community was immediate and vocal. Especially when it came to light that there was no easy way to enable MFA across an entire customer, they lost the high ground quickly. 

 – That brings us to the next issue Snowflake faced: they didn’t make it easy to enable MFA. Most vendors these days allow for a quick way to enforce MFA across all users at that customer; with many vendors now having it be opt-out; meaning customer users will have to use MFA unless the customer organization opts-out of that feature. MFA was opt-in for Snowflake customers, even those signing up more recently when the use of MFA was considered a best practice by the cybersecurity community at large. With no quick switch or toggle to change that, many customers found themselves scrambling to identify each user of Snowflake within their organization and turn MFA on for each, one by one. 

Snowflake, in the end, was not responsible for the breaches multiple customers fell victim to. While that is true; their handling of the situation, attempt to blame the victims loudly and immediately, and lack of a critical feature set (to enforce MFA customer-wide) has created a situation where they are seen as at-fault, even when they’re not. A great case-study for other service providers who may want to plan for potential negative press events before they end up having to deal with them. 

If you are a Snowflake customer, you should immediately locate and then enable the switch to enforce MFA on all user accounts. Your users can utilize either Microsoft or Google Authenticator apps, or whatever Single Sign-On/IAM systems your organization uses. 

Is Ransomware Getting Worse, or Does it Just Feel That Way?

A reader contributed a great question recently: “So many more ransomware attacks are getting talked about in the news. Is ransomware growing Noun broadcast 6870591 C462DD.that quickly, or does it just seem worse than it is?” The answer is “both,” but let’s break things down.

 

According to Security Magazine, ransomware has indeed grown exponentially in the last year, with an 81% increase in attack activity. That’s certainly not good, but may not be telling the whole story. While there’s no doubt that threat actors have increased attacks via Ransomware-as-a-Service (RaaS) and more sophisticated automation; some of what we’re seeing is an increase in the number of reported attacks compared to previous years.  

 

Better automation allows threat actors to perform more attack attempts in the same amount of time than they’d be able to perform manually. Scripting and automation have increased the effectiveness of legitimate organizations in many different ways. Processes like allowing a user access to an application which would have previously taken days or a week can now be done in seconds – safely. Stocks trades that would take hours in years past are now done in seconds – also safely, usually. As legitimate businesses have embraced automation to make their organizations better, threat actors have done the same. Now, a new exploit that would allow for a new attack, which would normally take weeks or months to see significant spread throughout the world, can become a major world-wide threat in hours. This, of course, means that more attack attempts leads to more successful attacks and higher numbers of organizations compromised year over year. 

 

RaaS allows established threat actor cartels to re-package and sell attack protocols they no longer use themselves to lower-tier threat actors. This extends the life of the product (the ransomware attack), and allows the cartel to continue to make money from it for much longer periods of time. By having more threat actors use existing tools against still-unpatched systems, more organizations end up compromised.

 

Both of these factors have lead to a marked increase in the total number of ransomware victim organizations over time, and that can’t be dismissed as a statistical blip or anything like that. We’re facing more attacks, more often, across more industries.

 

However, it should be noted that a huge portion of the compromised organizations would not – until recently – have reported the compromise at all. Businesses have many reasons to attempt to hide the fact that they fell victim to a ransomware attack. Loss of customer trust, violation of clauses in contracts, endangering future business – all reasons companies may choose to hide that an attack took place. This isn’t new behavior, as companies would often try to gloss over or bury anything that could impact their bottom line as you would expect – we’re just now talking about impacts caused by digital disasters instead of bad accounting practices, corporate espionage, and other more traditional events. 

 

Generally, if such hidden events and setbacks would cause overall market impact or jeopardize citizens of a country or locality; government agencies create regulation to make it mandatory to report it. This is not something that’s done frequently, and only occurs when the burying of such events would create major fallout in an entire market or a large group of citizens. Typically, new regulations only occur after such a major impact occurs. Over the last several years, the impact of cybersecurity incidents has indeed begun to cause fallout in markets, and has caused impact to massive amounts of citizens through identity theft and other problems. Because of this, governments have begun to pass legislation that makes it mandatory to quickly disclose any cybersecurity incident which might have a “material impact” to markets and/or consumers. You can read more about one such regulation in a previous post here.

 

In the USA, both the Federal Government (specifically the Securities and Exchange Commission) and several State Governments (most notably New York and California) have already passed regulations which compel organizations to report incidents via public filings. The SEC, for example, requires the filing of an amendment to a regular reporting form (8-K) within four days of any incident that has material impact, and the incident must also be part of the annual 10-K filing every public company and certain other companies must file. Since these reports are public, anyone and everyone can view them. Other US states either have regulations that are being/have been amended to cover cybersecurity incidents, or are creating new legislation to make disclosure mandatory for any companies that do business within that state or territory. The European Union and other nations/coalitions are also either strengthening reporting regulations or implementing new regulations specifically around cybersecurity incident reporting.

 

The practical upshot of this is that significantly more incidents are becoming public knowledge that would not have been publicly reported previously. Incidents that would have been “swept under the rug” in previous years are now becoming public knowledge quickly, leading to a marked uptick in the number of known attack victim organizations. While this number is certainly not enough to account for the total increase in attacks, it has most definitely increased the number of reported attacks over the last few years. The combination has lead to massive increases in year-over-year ransomware reports, leading to dramatic news reporting on the problem. As the issue becomes more sensational, everyone hears about it more often and with more volume.

 

So, while it is true that the total number of ransomware attacks has increased sharply due to a combination of the rise Ransomware-as-a-Service and the use of automation in threat actor activities, it is important to also realize some of the sensational numbers are attributable to companies being required to talk about the problem more than in the past. In total, the issue of ransomware and other cybercrime is taking a much bigger share of the public interest – which is a very good thing – but we must look at all of the factors that lead to such numbers to more fully understand what’s going on.