The continuous evolution of microservices and cloud architectures has consumers and businesses depending on complex systems in which failures are increasingly hard to predict. Distributed systems at scale can improve development flexibility and deployment velocity but are chaotic in essence. From the interactions between separate services within a system to real-world disruptive events, unpredictable outcomes loom over production environments at any time. As such, it is paramount to identify potential weaknesses before they hit customers.
Chaos engineering is a controlled experiment that allows engineers to evaluate how their systems respond to failure."
Chaos engineering is the discipline of testing a system’s response to failure conditions in production. By purposefully and proactively “breaking things,” engineers can compare what they estimate will happen with what really happens when their systems crash. With chaos engineering, failures are identified and fixed before they lead to outages.
When Netflix suffered its infamous three-day database corruption during which they could not ship DVDs to their customers, the company decided it was time to move from a vertically scaled software stack to a distributed cloud architecture. However, moving to a horizontally distributed system encompassing hundreds of microservices entailed a new level of complexity.
The elaborate interconnectedness of the cloud architecture called for more reliable systems. As a result, Netflix engineers needed a tool to test how their system functioned in aberrant circumstances by causing deliberate and proactive failures. By observing the system’s behavior, they knew how to adjust it in order to ensure that other services would sustain unforeseen crashes. Thus, in 2010, Netflix created Chaos Monkey, a tool designed to intentionally terminate instances and services to test system stability.
As stated on the Netflix Tech Blog, the company “created Chaos Monkey to randomly choose servers in our production environment and turn them off during business hours. Some people thought this was crazy, but we couldn’t depend on the infrequent occurrence to impact behavior. Knowing that this would happen on a frequent basis created strong alignment among our engineers to build in the redundancy and automation to survive this type of incident without any impact to the millions of Netflix members around the world.”
Following the success of Chaos Monkey, the engineering team at Netflix developed a series of additional tools, called The Simian Army, that cause other types of failure, thus setting the grounds for the practice of Chaos Engineering as we know it today.
Chaos Engineering has evolved from a single fault-inducing tool to a discipline designed to address the uncertainty inherent to the complexity of cloud architecture and microservices. Through controlled experiments, Chaos Engineering allows engineers to simulate real-world disruptive events and find potential glitches before they occur.
Understanding how a system functions when facing failure is beneficial on multiple levels: it keeps customers happy by catching issues before they disrupt their daily life; it helps prevent substantial losses in revenue and maintenance costs; it increases availability and reduces mean time to resolution.
The implementation of Chaos Engineering was popular mainly among e-commerce companies that register revenue loss from downtime directly or big tech organizations such as Google, Amazon, and Microsoft that actively track the ROI of investing in reliability.
However, in today’s advanced technology and accelerated digital transformation, an increasing number of essential services (from remote work to grocery shopping to working out from home) rely on complex cloud architectures. Furthermore, customers expect and demand uninterrupted and reliable services in practically all industries, emphasizing the importance of Chaos Engineering as a business strategy.
According to the DevOps and Cloud InfoQ Trends Report, Chaos Engineering has reached the Early Adopters stage: “We believe that the topic of chaos engineering has moved into early adopter, largely due to the increased promotion by the Netflix team and the O’Reilly Chaos Engineering book authors, and tooling such as the Chaos Toolkit and Gremlin’s as-a-service offerings.”
Former Netflix Senior Software Engineer Kolton Andrus states that he’s happy to see the evolution of Chaos Engineering from “the simple, blunt tooling of ‘Chaos Monkey’ killing random servers, preferring the more scientific and thoughtful approaches of next-generation tooling. As an industry, we need to continue pushing chaos engineering forward, making it simple and effective for engineers to thoughtfully test and experiment against their systems to better understand how they will behave under stress.”
As more businesses acknowledge the strategic role of digital transformation to maintain competitiveness and deliver value to customers through continuous and reliable services, the application of Chaos Engineering will most likely track rapid growth. Because effective digital transformation implies a clear focus on reliability, taking a proactive approach to building more resilient systems becomes key to achieve success.
From a tool designed to induce random instance failures in a production environment to an array of tools and principles, Chaos Engineering is now globally embraced by the tech community.
As the world relies more and more on complex cloud infrastructures, the ability to identify potential issues before they lead to total outages can benefit businesses in all industries. It is never too soon to start implementing Chaos Engineering. Even simply thinking about a system from a Chaos Engineering perspective creates the opportunity for significant improvements.