CrowdStrike Outage: Lessons Learned in Controls & Resiliency

By Maggie Cheney Published on August 7, 2024

The recent CrowdStrike outage, which caused widespread system crashes and disruptions, served as an important reminder of the interconnectedness and fragility of our world as it relates to technology. While the incident was disruptive and many of our clients can attest to the headaches it caused, it also provided valuable insight into how organizations can enhance their controls to protect against future outages.

What Caused the CrowdStrike Failure?

The primary driver of the outage was a faulty update to CrowdStrike’s Falcon Sensor software. This update contained an undetected error that led to a critical issue when processed, resulting in a Blue Screen of Death (BSOD) on affected Windows systems. This update impacted millions of computers running the Microsoft Windows operating system. The widespread use of CrowdStrike’s security software by major cloud platforms like Microsoft Azure amplified the effects of the outage. This led to disruptions across various industries, including airlines, banking, media, and government.

Why Is the CrowdStrike Outage Important?

The CrowdStrike outage was a clear reminder to organizations of the importance of prioritizing cybersecurity, internal controls, and business resilience. Given the scale of this outage (many believe it to be the largest IT outage in history), there are lessons to be learned. Businesses can reduce their risk of experiencing similar disruptions and protect their operations from similar threats by spending time analyzing these lessons.

It’s important to remember that cybersecurity is a continuous endeavor, not a point-in-time checklist. Many of my recent conversations with my clients have centered around the concept of continuous controls monitoring. Continuously monitoring and evaluating controls, and implementing real-time correction strategies when errors are identified, is essential to maintaining a strong security posture.

This doesn’t just include implementing continuous compliance tools, though such tools can be helpful. It means that organizations need to prioritize security and compliance through investment, engage management and the board of directors in the risk management process, and build a culture of internal control throughout the organization to better protect the company’s critical systems and data.

Key Lessons Learned

As I said, there are many lessons to be learned when IT outages and other security incidents arise. Organizations would be wise to perform their own analysis of the root cause issues and remedial actions taken by CrowdStrike and Microsoft as a result of this outage, whether or not their organization was directly impacted by it. There are broader takeaways that the Crowdstrike outage highlighted, such as the need for strong processes and controls that span across an organization. These processes and controls include the following.

Third-Party Risk Management

The outage underscored the need for comprehensive third-party risk assessment and management. Organizations must have a clear understanding of their dependencies and develop contingency plans for when critical vendors experience disruptions. Additionally, organizations need to understand the services each vendor provides, as well as the controls they will need to have in place to support those services and complement their control environment.

Incident Response and Communication

Effective incident response and communication are essential for mitigating the impact of outages. Clear and timely communication with customers and stakeholders can help to manage expectations and build trust. I recently spoke with a client who had experienced a security incident, and while the overall impact was not nearly that of the CrowdStrike outage, they conveyed that proactive communications with their clients resulted in easier conversations overall. The best way to determine and manage your incident response preparedness is to test your incident response plan. The test can vary in format, but should be realistic enough to identify gaps in your process so you’re not scrambling in the case of an actual incident.

Diverse Security Stack

Relying on a single vendor for critical security functions can increase risk. Diversifying your security stack can help to reduce the impact of potential failures. This means implementing multiple technologies, tools, and controls across key areas such as network security (firewalls, intrusion detection/prevention systems, virtual private networks, web application firewalls), endpoint security (antivirus, anti-malware, endpoint detection, and response encryption), identity and access management (multifactor authentication, single sign-on, access controls), cloud security (cloud access security broker, cloud workload protection platforms), and data security (data loss prevention, data encryption, data backup, and recovery).

Robust Testing and Validation

Rigorous testing and validation processes are crucial for preventing software defects that can lead to widespread outages. Your first layer of defense when making a change to your system is a change management process and software development life cycle (SDLC) that requires changes to be sufficiently tested, validated, and approved before they are implemented. I have worked with organizations that specifically analyze incidents that were caused by a change, as this can inform whether your test plans and validations are sufficient.

Disaster Recovery Planning

The recent trend I have seen in working with clients that are now hosted in the cloud is to point to their hosting provider as their disaster recovery strategy. But what happens when your critical vendors, including your hosting providers, experience outages and their disaster recovery plan fails? What if their recovery objectives do not align with yours? Organizations should implement their own disaster recovery plan that accounts for such dependencies and strives to minimize downtime and support business continuity that aligns with your company’s objectives and requirements. Your disaster recovery strategy should also include regular testing and updating of such plans.

Employee Training

Employees should be trained on how to respond to IT outages and have access to clear communication channels. This likely includes engaging them in your incident response and disaster recovery tests. It also may mean requiring targeted training on key cybersecurity concepts, secure coding requirements, and your change management/SDLC policies.

How to Build a More Resilient Organization

Building on the lessons learned, there are some key takeaways specific to building resilience in your organization that should be considered:

Strengthen Third-Party Risk Management: Conduct thorough assessments of third-party vendors and their security practices.
Invest in Cybersecurity Training: Empower employees to recognize and report potential security threats.
Adopt a Zero-Trust Security Model: Implement a layered security approach that verifies and continuously validates access to systems and data.
Embrace Automation: Automate security processes to improve efficiency and reduce human error.
Regularly Test and Update Systems: Conduct ongoing testing and updates to identify and address vulnerabilities.

The CrowdStrike outage serves as a valuable wake-up call for organizations to prioritize cybersecurity and resilience. By learning from this incident and implementing the necessary measures, businesses can better protect themselves from future disruptions.

Leveraging Operational Risk Management for Your Organization

I’ve previously discussed the concept of Operational Risk Management, and these lessons only reinforce the importance of implementing a risk management program that spans your organization. If you were to assign ownership for each of the process and control areas highlighted in the previous section, you would see that accountability does not just reside with the security officer, or with IT (though much of it falls to them). Incidents like this are a good reminder that it takes everyone in an organization – from IT, security, procurement/vendor management, operations, human resources, and beyond to build a culture of risk management and internal control.

How SOC 2 & Other IT Compliance Frameworks Can Help

All of the concepts I’ve discussed above are inherent in the SOC 2 framework and other leading IT compliance frameworks. In a previous blog, I described some of the leading IT compliance frameworks and how a good framework should inform your controls and policies. Choosing the right IT compliance framework can support your organization in implementing controls around risk management, third-party risk, incident response, training and awareness, etc. that in turn support your organization’s objectives and compliance requirements. I’ve also found in my years of working with clients that being forced to be compliant with an IT framework brought structure and discipline throughout the organization, beyond just the control requirements. I think that’s a testament to these frameworks and how they can (and should) be a strategic investment in an organization’s long-term success.

In Summary: How to Make Good of a Bad Situation

IT outages and security incidents are never fun to experience. Stress levels are high during and after, and business may be lost as a result. However, incidents like the CrowdStrike outage can serve as a good reminder to check in on your internal controls and cybersecurity program and continue to improve the mechanisms you have in place. Technology is always evolving, which means the risks associated with technology are always changing, so being forced to re-evaluate risk in light of a bad situation can only bring a good outcome. If you would like more information on how Linford & Company can help, please do not hesitate to reach out.

About The Author

Maggie Cheney

Maggie has over 15 years of experience in Risk Management and IT Compliance. She spent nearly 10 years in KPMG’s IT Advisory and Attestation practice before joining a financial technology company as the Risk and Compliance Director. She has overseen numerous SOC 1 / SOC 2 audits and other IT Compliance audits and has vast experience implementing risk management and IT compliance solutions. She is Certified in Risk and Information Systems Control (CRISC) and obtained a Bachelor of Science in Business Administration, Finance, from the University of Colorado at Boulder.

CrowdStrike Outage: Lessons Learned in Controls & Resiliency

In this Article

What Caused the CrowdStrike Failure?

Why Is the CrowdStrike Outage Important?

Key Lessons Learned

Third-Party Risk Management

Incident Response and Communication

Diverse Security Stack

Robust Testing and Validation

Disaster Recovery Planning

Employee Training

How to Build a More Resilient Organization

Leveraging Operational Risk Management for Your Organization

How SOC 2 & Other IT Compliance Frameworks Can Help

In Summary: How to Make Good of a Bad Situation

About The Author

Maggie Cheney

Maggie Cheney

Partner | CRISC

Our Auditing Services

Download Free eBooks

Popular Posts

New Posts

CCPA Cybersecurity Audit Requirements: What You Need to Know

The HIPAA Contingency Plan with a SOC 2 Spin

SOC 2 Physical Security in a Remote-First World: What Auditors Actually Look For

What Happened to SAS 70? Understanding SOC 1 Reports Today

SOC 2 vs. HIPAA Compliance: What Auditors See When Organizations Confuse the Two

What is Data Security? What SOC 2 Auditors Look For in Practice