Uptime is the performance measure customers and service users judge you on. But in today’s interconnected world, a good score is getting harder to achieve. We’ve moved on from systems that are monolithic, highly controlled, and only occasionally updated.
Today, software runs on multiple servers, relies on distributed networks, and is frequently updated. There are many more opportunities for things to go wrong, problems are much harder to find – and in many cases, the problem will be outside your control in any case.
There really isn’t any room for maneuver, either. The difference between 99% and the gold standard of 99.9999% uptime is significant. It’s the difference between over 3.5 days of downtime in a year, which would be unacceptable to many people, and just over 30 seconds, which is potentially barely noticeable.
Testing goes some way to finding and fixing the problems. But by its very nature testing only finds and fixes known problems or problems that can be anticipated.
It doesn’t test for different configurations, different error conditions, or the many factors beyond your control, such as the failure of your third party host server or a surge in usage.
It’s these problems that will really trip you up – and bring down your systems.
Of course, your customers or service users don’t care about the complexity of your systems or the load that’s being placed on them. All they see is a system that is unreliable. In many cases, downtime is frustrating and leads to a damaging loss of reputation. But in some sectors, such as aerospace, defense, or health, downtime could be literally life-threatening.
It is possible to fix problems – or at least find workarounds – on-the-fly when the problem happens. But there are risks to fixing problems in a pressurized situation. And with the clock ticking, every second counts.
Much better is to investigate and develop solutions for problems before they happen. To do this, many organizations are turning to a testing concept developed by engineers at Netflix. It’s called chaos engineering.
The method sees testers proactively perform experiments, inject failures, and engineer disaster scenarios so solutions can be developed thoroughly and calmly rather than in the heat of the moment when the clock is ticking.
The idea is to understand what happens when chaos ensues – not cause chaos. It’s therefore very tightly controlled. Amongst other considerations to take control to the next level, it’s a best practice to use tools that provide structure to the testing and to automate tests of known failures to maximize efficiency.
In short, by harnessing what chaos engineering has to offer and embracing solutions that offer open, extensible, and easy-to-use performance and load testing tools – such as Eggplant Performance – testers can deliver more resilient systems that offer more reliability and a better ROI.