The idea that a single IT misstep could cripple companies across entire industries might have seemed like a huge overstatement. However, the recent Microsoft outage is a stark reminder of how interconnected our world is. On July 19th, 2024, a faulty security update from CrowdStrike wreaked havoc on Microsoft Windows systems worldwide. How could such an IT catastrophe unfold? Let’s dive in and explore the causes.
What happened?
Many high-profile companies, such as JP Morgan Chase, Walmart, and Shell, use Falcon, CrowdStrike’s cybersecurity software, to protect their IT infrastructure from data breaches. In fact, it’s used by 82 percent of US state governments and 48 percent of the largest US cities.
Unlike traditional security systems that require bulky hardware and constant updates, CrowdStrike Falcon operates entirely in the cloud. It works through an agent installed on user devices, be it Windows, Mac, or Linux. Once installed, this program connects seamlessly to CrowdStrike’s cloud platform.
So, CrowdStrike’s latest software update for Windows users turned out to be faulty, causing a Blue Screen of Death (BSOD) at boot. You need the system to boot to be able to roll back that update, which is a dead-end scenario for a non-technical user.
Adding to the confusion, an outage hit Microsoft Azure services and the Microsoft 365 suite of apps in the central US earlier on Thursday. While a company spokesperson clarified these as separate issues (one impacting Azure, the other CrowdStrike), cybersecurity consultant Thomas Parenty (a former National Security Agency analyst) offered a different perspective. He suggests a possible link: “The systems required for the connection to Azure could have been initially impacted by the CrowdStrike issue, rendering the service unavailable.”
Talk about a chaotic end to the week!
What are the consequences?
The widespread impact of this incident is staggering, considering the CrowdStrike agent is installed on millions of devices – from servers and personal computers to even internet-connected devices (IoT). The update, intended to enhance system security, ironically caused widespread crashes across various industries, including:
Transportation: Airports around the world faced flight delays and cancellations as critical systems malfunctioned.
Healthcare: Appointment booking systems and patient record access were hindered at hospitals and clinics.
Finance: Banks and other financial institutions experienced technical difficulties.
Media: Television stations, like Sky News, were temporarily knocked off the air.
Why did this happen?
Firstly, most organizations deploy software updates automatically, so the rogue update spread like wildfire. Secondly, the culprit was a poorly written code – an error that CrowdStrike has since taken full responsibility for. While the exact details of this blunder remain unclear, one thing is certain: rigorous software testing could have prevented this IT disaster entirely, or at least significantly reduced its impact.
Why the update might have caused issues:
Insufficient Testing: CrowdStrike’s QA process for the update might not have been thorough enough. Software as critical as Falcon should be tested on dozens of devices and hundreds of different environments.
Testing Environment Limitations: CrowdStrike’s testing environment might not have perfectly replicated real-world conditions. This could lead to issues showing up only when the update interacts with other software on user machines.
Regression Testing Overlooked: It’s possible that regression testing, which ensures new updates don’t break existing functionalities, might not have been conducted properly. This could explain why a security update triggered system crashes.
Human Factor: We are only human, after all. And it’s a universally agreed-upon IT rule not to release anything on Fridays. Perhaps it was a time-sensitive update, and people tend to make the silliest mistakes when pressed for time.
What can Windows users do?
The good news is that CrowdStrike engineers shared a workaround. Here it is:
- Boot Windows into Safe Mode or the Windows Recovery Environment
- Navigate to the C:\Windows\System32\drivers\CrowdStrike directory
- Locate the file matching “C-00000291*.sys”, and delete it
- Boot the host normally
The bad news is that it doesn’t work for everyone. First, if you’re not a techie, you don’t understand half of what’s here. Also, this manual fix can’t be applied remotely or to cloud-based systems, requiring physical access to each impacted device. Unfortunately, this translates to a lengthy recovery process for system administrators.
Prevention is better than cure
So, what lessons can we learn from one of the most widespread tech meltdowns? Prevention is always better than cure. While having a detailed incident response plan is good, what’s even better is having an ongoing and well-established quality assurance process.
Prevent Faulty Updates With These Testing Types:
Unit Testing: This involves testing individual components of the update in isolation. This could have identified issues within the update code itself before it interacted with other software.
Functional Testing: This verifies if the update delivers its intended functionality without causing unintended consequences. This could have involved testing the update on various Windows configurations and with different software combinations.
Integration Testing: This focuses on how the update interacts with other software on a system. This could have revealed compatibility issues with specific Windows versions or drivers.
Regression Testing: This ensures the update doesn’t introduce new vulnerabilities or crashes.
Strengthen Your Infrastructure and Processes:
Canary Releases: Deploy new updates to a small subset of users first to identify any critical issues before a full rollout.
Feature Flags: Isolate new features behind feature flags to enable quick rollback if necessary.
Immutable Infrastructure: Use immutable infrastructure to reduce the risk of configuration drift and make rollbacks easier.
Chaos Engineering: Intentionally introduce failures into the system to identify vulnerabilities and improve resilience.
Post-Mortem Analysis: Conduct detailed post-mortem analyses after incidents to identify root causes and implement preventive measures.
Improve Communication and Collaboration:
Incident Response Playbook: Create a detailed incident response playbook that outlines roles, responsibilities, and procedures for handling outages.
Cross-Functional Teams: Foster collaboration between development, QA, operations, and security teams to ensure a holistic approach to software development.
Regular Security Audits: Conduct regular security audits to identify and address vulnerabilities.
Top comments (0)