It’s not been a great few months for IT infrastructure here. We’ve had a run of problems which have had significant impact on the services we deliver. The problems have all been in the foundation services which means that their effects have been wide-ranging.
This informal post is aimed at a technical audience. It’s written from a IT systems point of view because that’s my background. We’ve done lengthy internal reviews of all of these incidents from technical and incident-handling viewpoints and we’re working on (a) improving communications during major incidents and (b) making our IT infrastructure more robust.
Back in November and December we had a long-running problem with the performance and reliability of our main IT infrastructure. At a system/network level this appeared to be unreliable communications between servers and the network storage they use (the majority of our systems use iSCSI storage so are heavily reliant on a reliable network). We (and our suppliers) spent weeks looking for the cause and going down several blind alleys which seemed very logical at the time.
The problems started after one of the network switches at our second data centre failed over from one controller to the stand-by controller. There were no indications of any problems with the new controller so theory was that something external had happened which caused the failover _and_ lead to the performance problems. We kept considering the controller as a potential cause but discounted it since it reported as healthy.
After checking the obvious things (faulty network connections, “failing but not yet failed” disks) we sent a bundle of configs and stats to the vendor for them to investigate. They identified some issues with mismatched flow control on the network links. Theory was that this had been like this since installation but only had significant impact as the systems got busier. We updated config on both sides of link and that seemed to give some improvement but obviously didn’t fix the underlying problem. We went back to the vendor and continued investigations across all of the infrastructure but nothing showed up as a root cause.
Shortly before the Christmas break we failed over from the (apparently working) controller card in the main network switch at our second data centre to the original one – this didn’t seem logical as it wasn’t reporting any errors but we were running out of other options. However (to our surprise and delight) this brought an immediate improvement in reliability and we scheduled replacement of the (assumed) faulty part. We all gave a heavy sigh of relief (this was the week before the University closed for the Christmas break) and mentally kicked ourselves for not trying this earlier (despite the fact that the controller had been reporting itself as perfectly healthy throughout).
At the end of January similar issues reappeared. Having learnt our lesson from last time we failed over to the new controller very quickly – this didn’t have the hoped-for effect but we convinced ourselves that things were recovering. In hindsight improvement was because it was late on Friday afternoon and the load was decreasing. On Saturday morning things were worse and the team reassembled to investigate. This time we identified one of a pair of network links which was reporting errors. The pair of links were bonded together to provide higher bandwidth and a degree of resilience. We disabled the faulty component leaving the link working but with half the usual throughput (but still able to handle normal usage) and this fixed things (we thought). Services were stable for the rest of the week but on Monday morning it was clear that there was still a problem. At this point we failed back to the original controller and things improved. Given that we were confident that the controller itself wasn’t faulty (it had been replaced at the start of the month) the implication was that there was a problem with the switch which is a much bigger problem (this is what one of these switches looks like). We’re now working with our suppliers to investigate and fix this with minimal impact on service to the University.
In the last few weeks we’ve had problems with the campus network being overloaded by outfall from an academic practical exercise, a denial of service attack on the main web server and thousands of repeated requests to external DNS servers causing the firewall to run out of resource – but they’re stories for another day.