Failure is inevitable (or is it?)

A very thought provoking paper on why complex systems fail
– How Complex Systems Fail by Richard Cook

This was referenced by John Allspaw on his blog (http://www.kitchensoap.com/2009/11/12/how-complex-systems-fail-a-webops-perspective/)

One of the most interesting points for me is number ten “All practitioner actions are gambles”. Whenever we do (or don’t do) something – upgrade a package, reboot a server, restart a service there’s a risk that it’s not going to end happily. We can (often) minimise the risk by trying the operation in a test environment first and mitigate the consequences of failure by having a backout plan (and backups :->) but sometimes our experience tells us that we should “just do it” and it will be fine. Most of the time it is but sometimes it isn’t. This leads into point seven “post accident attribution to a ‘root cause’ is fundamentally wrong”. Yes there will be a trigger whether it’s a dodgy disk controller or the (apparently) unrelated package updated last month but the _real_ problem is that the odds are against us.

When running production services we have to balance the costs of testing, scheduled downtime, redundancy against the probability of failure (and the cost of resulting unscheduled downtime). Because of these costs we may (and do) run systems with known issues. We obviously can’t do this if the problem has a direct impact on service but if it’s a failure which is masked by redundancy (say a single fibre path) and the costs and risks involved in investigating the failure and bringing the system back to perfection are judged to be too high then we might decide to leave well alone. This doesn’t make me happy but I’ve got a clearer framework to think about it now.

This ties in with “Better”, a book by Atul Gawande (http://www.gawande.com/) that I’ve just read. He’s talking about how surgeons in particular (he’s a surgeon and doctors in general make decisions and how they try to improve. A big part of this is the need for measurement and reflection – (a) collect data and (b) think about it.

Infrastructure Systems Group Blog

Windows and Linux Infrastructure at Newcastle University

Leave a Reply Cancel reply