Just back (well last week – taken me a while to write up my notes) from the UCISA Infrastructure Group (UCISA-IG) event in Liverpool – “Service availability – is 24x7x365 really necessary?”. These notes are very rough but I’d rather get them out now while reasonably fresh.
This sort of event is always worthwhile not just because of the “formal” talks but also the chance to meet colleagues from other institutions and talk about common issues. Doing this face to face allows you to be a bit less discreet than you would be on a mailing list :->. Topics that came up in passing were account management systems (why does everyone seem to write their own?); how IT services are organised internally (by platform/by layer/at random) and the difference between working in a large IT service (where most people are specialists and much of what your colleagues do is a black (or at least grey) box) and a small organisation where the IT person is likely to do network/storage/desktop/servers/everything else (because there’s no-one else).
Whilst the event was interesting and useful I felt the title was a bit misleading – most of it was talking about DR and BC (Business Continuity) rather than whether universities need 24×7 services. My instincts are
1. Not everything needs the same level of availability
2. If more services were designed to use asynchronous communication and message queues we wouldn’t have to have such a broad shutdown of services on the (hopefully rare) occasions that we need to shutdown one of the fundamental systems. Constructing a concrete example; if a member of the University needs to update their address does it matter if the database change happens instantaneously or is it OK if the change is made within half a day. The important thing is that they should be able to submit their change whenever is convenient (and that they get some feedback when it’s complete). Moving to reliable loose coupling should reduce our need for everything running all the time.
3. Some systems are intrinsically easy to make resilient. My favourite is mail relaying (not the complete mail service – just the pure relay). Because each transaction is independent and there’s a standard mechanism to distribute requests between servers (MX records) it’s easy – you just add more servers (though there was the problem with large MX sets and poorly configured remote systems – I think that hit us when we got to 10 entries in our MX list).
Opening session was David Teed talking through the processes you would use to set your recovery targets for services. Not everything needs to be recovered in 4 hours – working through Business Impact Analysis and leading to an ICT Recovery Statement (what you’ll recover, how long will it take and what workarounds will the business use to cope in the mean time). This leads to list of resource requirements and allows you to manage customer expectations and cost justify solutions.
Idea is that you then invest – matching the requirements exposed by BIA (not going overboard on making things over resilient – though you may do more if it brings other benefits). All very sensible and if we haven’t done something like this already we should.
Next Adrian Ellison, LSE talking about working from home (WFH) as an item in the DR/BC toolbox.
Often a big part of the BC plan but there are (of course) issues. DR moved up LSE agenda after 7/7.
Alternative accommodation on the larger campus might be a better solution (as it maintains the face to face contact which is lost). As part of planning
allocate suitable alternative for each critical activity (making sensible assumptions on loss (of access to) buildings).
Reciprocal arrangements with neighbouring institutions may be a possibility.
Not everyone can work from home (and some can’t do all of their hobs) – specialised equipment/other people.
WFH isn’t sustainable for long.
To support WFH you need
– Resilient dual-path network with OOB access via 3rd party ISP (tested regularly)
– Robust DC strategy with resilience
– Likely that you’ll need to scale up remote access systems quickly. For Citrix etc will probably need extra licences
– Think about how you do remote support (LSE use LogMeIn Rescue)
– Separate VPN/remote access for IT staff?
Telephony – mobile networks may (will) become overloaded
Will need to divert key numbers to alternate locations (pre-arrange with supplier)
May be able to divert to external numbers (advanced IPT – “remote office”)
Remote learning – if lots of students are accessing rich content do we have bandwidth to cope (to halls?)
Information security is important but if you make things too difficult people will create their own workarounds which will be worse in terms of security.
Make clear that there is personal responsibility for security of data/systems under their control.
Managing people – motivation – all more difficult when remote – need, f2f meetings (off-site)
Off-site working relies on trust
Talk from Oracle/Strathclyde about how the availability features of 11g can help with resilience. The idea of automatic storage management (ASM) which (as I understood it) replicates data across multiple low cost modular storage arrays seems like a nice idea. Anything that helps us to move away from big, expensive boxes sit in the middle of everything (and tend to be fussy eaters).
Active data guard (ADG) – replication of data – can use replicated copy for read-onlt queries/BI etc as well as a backup to use as when the primary site fails (so that you’re getting some use out of the standby kit).
Talk by Adrian Jane, University of Plymouth on how they use IPstor appliances to virtualise storage. These boxes sit between the real storage and the machines using the storage. This allows you to do mirroring, migration and similar without downtime and without changing the configuration on the clients. IPstor boxes are hardened Linux servers. They obviously need to be replicated (as all the storage traffic flows through them) and reasonably chunky (for the same reason) Plymouth are using something like HP 585 G6 quad cpu (6 core), 32G ram, 4x 8Gb HBAs.
As well as the obvious advantages, there’s also the benefit of simpler client config – all the mirroring is done in the IPstor.
Last talk was Richard Smith, Sheffield Hallam University about how they use VMware. They moved further with VMware than we have – over 200 VMs (though I guess if we count up all of our Xen and VMware guests and add on all the the Solaris zones for SAP we’d get a similar number). Running higher numbers of guests per host than us (50 as a matter of course, up to 120). Vmotion allowed them to migrate services to new data centre with no downtime.
Vsphere can now use HP’s iLO technology to power up extra servers to cope with peak loads (and I think to reset hardware that appears to be hung).
Nice feature was the use of template VMs for Terminal Services servers – this let SHU scale up their TS capacity very quickly to cope with extra load when large numbers of people worked from home because of the bad weather at the start of the year.