Chaos Engineering: What is it? Should You Use it?

Chaos engineering

Incident Response Plans (IRP) are an extremely important element of dealing with security incidents. Traditionally, an IRP has been when an organization creates a scenario on paper and performs a walkthrough of the incident with key members of the incident response team to determine whether everyone understands what to do in the event an incident occurs. Chaos engineering has created a fresh perspective on this traditional security element.

In this post, we will review what an IRP looks like today and then compare that to how chaos engineering changes up the norm. Additionally, we will discuss how it works, whether it is risky, who uses it, and why it may be a good idea for your company.

 

Chaos engineering vs. incident response

What is Chaos Engineering vs. Incident Response?

The main goal of an IRP is to create a plan of action in the event that a security incident occurs. Security incidents include but are not limited to phishing attempts, SQL Injection Attacks, cross-site scripting, malware attacks, denial of service attacks, and the list goes on.

The reasoning behind having an incident response plan is that an organization creates a sequence of events and identifies key personnel so that the organization understands when it’s time to escalate a security incident and that key personnel understands their role in dealing with that incident.

At least annually the organization would create a possible scenario and talk through how the IRP would be executed. While this approach is still a valid approach when an actual security event occurs, often times there is still chaos within the organization. This is because testing of an IRP is controlled and everyone knows every aspect of the scenario created or if they do not, there is no real risk of data loss or the business being unavailable. Chaos engineering is changing that up.

The definition of chaos engineering is “the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”

Chaos engineering was created as a result of systems becoming more diverse in the different services and microservices that they entail and the interactions between those services causing unknown disruptions. These disruptions are compounded through the introduction of real-world variables such as different types of attacks.

In chaos engineering, the unknown interactions within the services are exploited on purpose so that the engineering team is required to execute the IRP caused by an actual exploit in the system environment.

 

How does chaos engineering work?

How Does Chaos Engineering Work?

Implementing chaos engineering requires an organization to be open to the possibility of exploiting unknown weaknesses within the system. Each possible exploitation can be thought of as an experiment. If I do this to the system or introduce this, what will happen? There are four basic steps that an organization can use for each experiment.

Define the “normal state” of the system or component of the system.

  1. Create an educated guess as to possible events that could occur when you introduce variables to try to destabilize the “normal state” of the system.
  2. Test theory by introducing variables. Variables can include a server crash, disruption of the network connection, hard drive issues, etc.
  3. Complete steps to get the system back to its “normal state” and document steps taken to get the system back to normal. This can be tracked using issue and project tracking software.

By implementing these steps, a company will be able to have more assurance on what they determine to be the normal state. Additionally, it should allow the organization to become more efficient at understanding and recovering from incidents.

 

Is chaos engineering risky?

Is Chaos Engineering Risky?

Yes, chaos engineering is a risky approach. Anytime an organization intentionally tries to destabilize the system, there is always a risk it could take longer to get the system back to the normal state than estimated or desired. As a result, it could impact the availability of the system.

Since this is a new approach, there are only a limited amount of companies who have utilized this approach but is it gaining traction. We will discuss that next. With that said, there are a number of open source projects which are aimed to implement chaos engineering in an environment that is less disruptive.

 

Who uses chaos engineering?

Who uses Chaos Engineering?

The most popular implementation of chaos engineering is Netflix. Netflix created a tool called Chaos Monkey. According to GitHub, Chaos Monkey, “randomly terminates virtual machine instances and containers that run inside of your production environment.

Netflix built Chaos Monkey on the premise that it exposes engineers to failures more frequently which incentivizes them to build resilient services.” Netflix learned the lesson that to avoid failing you must learn to fail constantly. From this lesson, Chaos Monkey was born. There are now many similar tools available for use.

Some other major implementations of chaos engineering include Google, Amazon, Microsoft, Dropbox, Yahoo!, uber, SendGrid, GitHub and the list goes on.

Why Should my Organization Consider Chaos Engineering?

If your organization is willing to tolerate the risk that chaos engineering can introduce, this method may be the best way to gain a higher level of assurance that your system and engineering team are truly prepared to tolerate a security incident. With that said, it’s important to ensure that your network architecture is set up to support this type of method.

An organization that utilizes a distributed architecture, comprised of microservices, is the type of organization that this method is geared toward. If this is not true for your organization, then this method is not the best approach.

Chaos Engineering Summarized

The number of major security incidents that results in a data breach or loss of system availability is still on the rise and along with that is the price of tools and different services to protect an organization’s environment. As systems are becoming increasingly complex, it is becoming harder to predict how a security incident may affect your organization.

While having some form of incident response plan that is tested is definitely something we always suggest as auditors, it’s possible that new methods, such as chaos engineering may be an effective tool that allows for an organization to gain an overall more stable environment while increasing their understanding of weaknesses that exist within the environment without purchasing another tool that may or may not help.

If your company has any questions on creating an effective IRP as part of an audit such as SOC 2 or HIPAA, please reach out for more information. Also, for more information check out some other Linford & Company posts that relate to this one below: