Grigor Khachatryan

Director of Engineering, Platform | Los Angeles, CA


Resilience Engineering — Don’t Be Afraid to Show Your Vulnerable Side!

Published February 10, 2021

Every software developer’s primary goal is to come up with a practical, intuitive, and robust product, a platform or service lots of people can use without any major issue. The problem is that what happens out there with real users is a lot more, well, chaotic than in a control environment developers initially work in. That’s why more and more devs have been using specialized techniques to test out their handiwork and ensure optimal reliability.

Resilience engineering is a practice within Site Reliability Engineering (SRE), closely related to Chaos Engineering. If you’re having trouble wrapping your head around all these terms, don’t worry, we’ll cover each aspect separately and then show you the real magic behind properly applied resilience engineering.

First, here’s a short history lesson.

Where Did SRE Come from?

SRE dates back to almost two full decades ago when a ragtag group of Google devs tried to find a way to improve the reliability of the company’s sites and keep them working smoothly as they grew. Needless to say, these guys were so effective that their techniques and strategies were turned into an IT subset of its own.

It’s an important part of modern DevOps and helps bridge the gap between the initial framework created by developers and the highly practical concerns of real-life system administration.

The Advent of Chaos Engineering

These days, it’s not easy to see an issue coming a mile away and address them in advance to keep a company’s cloud-based platform up and running. And with even just 10–20 minutes of downtime, large corporations stand to lose a lot of potential business, as well as their brand equity. Enter the creatively destructive art of Chaos Engineering.

Think of it as handing the keys to your finely-tuned sedan to a rally driver to run it through its paces on the track and see what breaks first when the system is pushed to its limits.

If you do this on your first batch of sedans, you can go back to the shop and tweak out all the little glitches and potential weak spots, ensuring that the cars run like greased lightning for miles without breaking down when you actually start driving people in them.

The first example of this approach was Netflix’s surprisingly aptly named Chaos Monkey, back in 2010, and Simian Army a year later. It was simple, but it got the job done — it simulated a server failure by shutting down instances at random.

The Practice of Resilience Engineering

Resilience Engineering is all about building systems that can adapt and automatically take the best course of action when common issues occur. Any inadequacies found through testing are ironed out before the system can become truly resilient.

The Flow of a Basic Chaos Experiment

There are several basic steps to testing out the system’s vulnerabilities:

  • Define a baseline measurement for the system when things are running smoothly.
  • Come up with an idea of a potential failure.
  • Test for said failure on a small enough scale so as not to disrupt the whole system but still get measurable data you can act on.
  • Proceed to compare any issues that have popped up with the baseline performance.
  • Scale up your tests if no issues were found initially.

What happens with the system is often not the same as what the developers hypothesized would happen, making it an excellent learning opportunity.

The most common issues distributed systems are tested for are server failures — either a single server simply not responding, the network working periodically and crashing, or an entire host of servers going out.

The Ideal System Response to Shoot For

A resilient and well-put-together system will have a quick answer for the most common issues outlined above. For instance, if the cloud provider no longer permits access to a CPU, for whatever reason, the system should respond by connecting to the next best thing.

Also, if the entire network of servers in a particular time zone goes out, the system should look for servers in another region. If the number of users hits a peak in a very short time, the system should naturally scale up and start using more servers to compensate.

New Technologies Allow for Scalability and Automation, Cutting Down on Human Interventions

With the advent of Kubernetes, continuous delivery has been made much easier. The system’s response can be automated — the necessity for human intervention goes down dramatically, and the whole system experiences less downtime as a result. The ability to quickly scale up with sudden bursts of user traffic is incredibly important for cloud-based services.

Imagine giants like Netflix or Blizzard being unable to accommodate all the new users logging on, especially now that everyone is online throughout the day. Any amount of downtime would lead to hoards of unsatisfied customers ready to move on to other services who understand the power of continuous delivery.

Resilient Systems are Reliable and Competitive Systems

While we may have had a little bit of fun flexing our bad pun muscles in the title, the fact is that some companies really are afraid to look for vulnerabilities and address them early on. It’s tempting to start believing the myth of the flawless cloud environment, where all the servers work all the time, there’s no latency, and the number of users using your service barely fluctuates.

However, the reality of it is that if something can go wrong, it eventually will, and chances are you’re not going to be ready for it. Well, if you play it smart and invest in resilience engineering, you’ll save yourself a whole lot of headache down the line.