Uncategorized | Infrastructure Systems Group Blog

A very thought provoking paper on why complex systems fail
– How Complex Systems Fail by Richard Cook

This was referenced by John Allspaw on his blog (http://www.kitchensoap.com/2009/11/12/how-complex-systems-fail-a-webops-perspective/)

One of the most interesting points for me is number ten “All practitioner actions are gambles”. Whenever we do (or don’t do) something – upgrade a package, reboot a server, restart a service there’s a risk that it’s not going to end happily. We can (often) minimise the risk by trying the operation in a test environment first and mitigate the consequences of failure by having a backout plan (and backups :->) but sometimes our experience tells us that we should “just do it” and it will be fine. Most of the time it is but sometimes it isn’t. This leads into point seven “post accident attribution to a ‘root cause’ is fundamentally wrong”. Yes there will be a trigger whether it’s a dodgy disk controller or the (apparently) unrelated package updated last month but the _real_ problem is that the odds are against us.

When running production services we have to balance the costs of testing, scheduled downtime, redundancy against the probability of failure (and the cost of resulting unscheduled downtime). Because of these costs we may (and do) run systems with known issues. We obviously can’t do this if the problem has a direct impact on service but if it’s a failure which is masked by redundancy (say a single fibre path) and the costs and risks involved in investigating the failure and bringing the system back to perfection are judged to be too high then we might decide to leave well alone. This doesn’t make me happy but I’ve got a clearer framework to think about it now.

This ties in with “Better”, a book by Atul Gawande (http://www.gawande.com/) that I’ve just read. He’s talking about how surgeons in particular (he’s a surgeon and doctors in general make decisions and how they try to improve. A big part of this is the need for measurement and reflection – (a) collect data and (b) think about it.

I’ve been musing recently (actually for ages but the issues have only recently crystallised) about asset registers and/or inventories.
I’m talking about servers here rather than desktops – I’m sure the issues overlap but there are differences both in terms of scale and variety.

We need to have a an up-to-date asset register with information about where things are, how much they cost and who paid for them (both because the University says and and because it’s the right thing to do).

Since this information doesn’t obviously help us run services it gets viewed as an admin overhead and tends to be updated infrequently (usually just before the annual deadline).

My feeling is that the best way to get this info kept up-to-date is to use it for operational things – in my ideal world you would make config changes to a single database (which would apply policies to fill in defaults and highlight inappropriate settings) and the system would then generate the correct entries in all the other databases (from the config files we use for automated installs with jumpstart and kickstart to the entries in the Nagios and Munin monitoring systems).

We need to hold two type of information about systems. First the `financial’ data (cost, date of aquisition etc) and then the `system’ data (services provided, rack location, switch port, RAM installed, OS version etc).
Most (all) of the first set of data is fixed when the box arrives here and won’t change over time. Capturing this generally involves a human (gathering info from purchase order and physical items and sticking an asset tag on the box) and should be part of our goods inwards process.

Much of the second set of data will change over time and should be maintained automatically (OS version, RAM, network interfaces). Makes much more sense for the computers to keep this up-to-date. Stuff like which packages are installed and which services are running should be controlled by a configuration management system like cfengine. The CM system and the inventory need to be linked but I don’t think they’re the same thing.

There’s a set of data which (at least in our environment) changes less frequently. An example is location; most servers arrive, get racked and sit in the same rack location until they’re retired. We occasionally move existing servers both between server rooms (to balance up site resilience) or between racks within a room (perhaps if we’ve retired most things in a rack and want to clear it out completely). This process obviously involves a human and part of the process should be updating records to show new location. I’m keen to back this up with a consistency check (to catch the times where we forget). It should be possible to use the MAC addresses on the the network switches to find which servers are where (since there is a many to one mapping between network switches and rooms). Most of our server rooms have a set of rack in the middle with switches in and servers are connected via structured cabling and patch panels so this doesn’t help with moves within a room however we’re gradually moving towards having switches in the server racks.

I’ve been looking for an open source system that will help us do this. Open source isn’t an absolute requirement, open interfaces are (because I want to use this information to drive other things). I know we could lash together a MySQL database and web frontend to hold details entered by hand. I’m sure we could extend this to import info from system config reports generated on the servers themselves and sent back to the central server. The thing that stops me doing this is the feeling that someone out there must already have done this.

I recently came across the slide deck from Jordan Schwartz’s 2006 presentation on Open Source Server Inventories

http://www.uuasc.org/server-inventory.pdf

Which referenced the Data Center Markup Language site (http://www.dcml.org/) which has some interesting ideas about representing information about systems in a portable way. DCML seems to have gone quiet though.

Also referenced the Large Scale System Configuration group’s pages –
http://homepages.inf.ed.ac.uk/group/lssconf/iWeb/lssconf/LSSConf.html
Lots of interesting thoughts about how large systems could/should be managed (but nothing I could spot to solve my immediate problem).

I installed a number of asset tracking systems. None (so far) have gone click with me. It’s quite possible that I’ve missed the point with some of them but here’s my quick take.

Asset Tracker for RT
http://code.google.com/p/asset-tracker-4rt/
We don’t use RT (we use Peregrine’s ServiceCenter) so integration with RT doesn’t win us anything. As far as I can see this relies on manually entered data (though I’m sure it would be possible to automate some population of the asset database).

OCS Inventory NG

Accueil

I quite liked this one. Agents installed on the clients generate XML files which are sent back to the server. My main objection was the large set of prerequisites for the agents which made deployment painful. My ideal agent would be a very lightweight script with very few dependencies which used standard OS tools to gather info and then sent the info back to the server as simply as possible.

Racktables
http://racktables.org/
This one definitely looks interesting (but perhaps not for this immediate problem) and from a brief skim of the wiki it would be useful for getting a view of rack capacity for planning etc and dependencies. Some comments on the mailing list imply that its primary purpose isn’t an inventory system. No obvious way of doing bulk imports (but from a look at the database it wouldn’t be impossible).

RackMonkey
http://sourceforge.net/projects/rackmonkey/
Simpler version of Racktables? No obvious way of doing bulk imports.

Open-AudIT
http://www.open-audit.org/
This seems aimed more at auditing of desktops (makes a big play of the ability to get the serial number of the monitor which is undoubtedly useful if you’ve got hundreds of them, but all our servers are headless). I like the model of a simple text file generated on the client which is then imported by the server. Would need to produce Solaris version of agent.

In the longer term I expect that we’ll want to populate the asset database in Peregrine so that we can have better integration with ServiceCenter. I sure that’s the right thing to do but I suspect that the Pegegrine asset database will end up being an automatic (subset) replica of the main database (because there’s some stuff that will be best kept separately.

Infrastructure Systems Group Blog

Windows and Linux Infrastructure at Newcastle University

Category Archives: Uncategorized

Follow @NU_ITtech on Twitter

Pass phrases

Failure is inevitable (or is it?)

Asset registers