Office365 and SPAM filtering

We’ve started the process of migrating staff email to Office 365 (see We moved the IT Service this week – not totally without problems but that’s one reason we started on ourselves.

We’ve had some feedback that people are getting more spam since the move which surprises us. We’re using Office 365 in hybrid mode which means that all mail from outside the University comes through our on-site mail gateways (as it always has) before being delivered to the Office 365 servers.

The graph below shows stats for the last month – we’re still rejecting over 80% of the messages that arrive at the gateways. We know this isn’t catching everything but there’s a dramatic difference between an unfiltered mailbox to a filtered one.

Mail filtering stats for last month

SPAM filtering for the last month

Infrastructure issues (part 2)

Back in March we had performance issues with our firewalls. One of the things that our vendor raised was what they saw as an unusually high number of DNS queries to external servers. We were seeing around 2-3000 requests/second from our caching DNS servers to the outside world.

A bit more monitoring (encouraged by other sites reporting significantly lower rates than us) identified a couple of sources of unusual load:

1. The solution we use for filtering incoming mail sends DNS queries to all servers listed in resolv.conf in parallel. That doesn’t give any benefit in our environment so we changed things so that it only uses the caching DNS server on localhost.

2. We were seeing high rates of reverse lookups for IP addresses in ranges belonging to Google  (and others) which are getting SERVFAIL responses. These are uncacheable so always result in queries to external servers. To test this theory I installed dummy empty reverse zones on the caching name servers and the queries immediately dried up. The fake empty zones meant that the local servers would return a cacheable NXDOMAIN rather than SERVFAIL.

An example of a query that results in SERVFAIL is [should be]). That was being requested half a dozen times a second through one of our DNS servers. just caught my eye – there are probably many others generating a similar rate.

Asking colleagues at other institutions via the ucisa-ig list and on ServerFault reinforced the hypothesis that (a) the main DNS servers were doing the right thing and (b) this was a local config problem (because no-one else was seeing this).

Turned on request logging on the BIND DNS servers and used the usual grep/awk/sort pipeline to summarise – that showed that most requests were coming from the Windows domain controllers.

Armed with this information we looked at the config on the Windows servers again and the cause was obvious. It was a very long-standing misconfiguration of the DNS server on the domain controllers – they were set to forward not only to a pair of caching servers running Bind (as I thought) but also all the other domain controllers which would in turn forward the request to the same set of servers. I’m surprised that this hadn’t been worse/shown up before since as long as the domain returns SERVFAIL the requests just keep circulating round.

The graph below shows the rate of requests that gave a SERVFAIL response – note the sharp decrease in March when we made the change to the DNS config on the AD servers. [in a fit of tidiness I deleted the original image file and now don’t have the stats to recreate it – the replacement doesn’t cover the same period]


I can see why this might have seemed like a sensible configuration at the time – it looks (at one level) similar to the idea of a set of squid proxies asking their peers it they already have a resource cached). Queries that didn’t result in SERVFAIL were fine (so the obvious tests wouldn’t show any problems).

Postscript: I realised this morning that we’d almost certainly seen symptoms of this problem early last July – graph below shows the very sharp increase in requests followed by the sharp decrease when we installed some fake empty zones. This high level of requests was provoked by an unknown client on campus looking up random hosts in three domains which were all returning SERVFAIL. Sadly we didn’t identify the DC misconfiguration at the time.


Recent infrastructure issues (part 1)

It’s not been a great few months for IT infrastructure here. We’ve had a run of problems which have had significant impact on the services we deliver. The problems have all been in the foundation services which means that their effects have been wide-ranging.

This informal post is aimed at a technical audience. It’s written from a IT systems point of view because that’s my background. We’ve done lengthy internal reviews of all of these incidents from technical and incident-handling viewpoints and we’re working on (a) improving communications during major incidents and (b) making our IT infrastructure more robust.

Back in November and December we had a long-running problem with the performance and reliability of our main IT infrastructure. At a system/network level this appeared to be unreliable communications between servers and the network storage they use (the majority of our systems use iSCSI storage so are heavily reliant on a reliable network). We  (and our suppliers) spent weeks looking for the cause and going down several blind alleys which seemed very logical at the time.

The problems started after one of the network switches at our second data centre failed over from one controller to the stand-by controller. There were no indications of any problems with the new controller so theory was that something external had happened which caused the failover  _and_ lead to the performance problems. We kept considering the controller as a potential cause but discounted it since it reported as healthy.

After checking the obvious things (faulty network connections, “failing but not yet failed” disks) we sent a bundle of configs and stats to the vendor for them to investigate. They identified some issues with mismatched flow control on the network links. Theory was that this had been like this since installation but only had significant impact as the systems got busier. We updated config on both sides of link and that seemed to give some improvement but obviously didn’t fix the underlying problem. We went back to the vendor and continued investigations across all of the infrastructure but nothing showed up as a root cause.

Shortly before the Christmas break we failed over from the (apparently working) controller card in the main network switch at our second data centre to the original one – this didn’t seem logical as it wasn’t reporting any errors but we were running out of other options. However (to our surprise and delight) this brought an immediate improvement in reliability and we scheduled replacement of the (assumed) faulty part. We all gave a heavy sigh of relief (this was the week before the University closed for the Christmas break) and mentally kicked ourselves for not trying this earlier (despite the fact that the controller had been reporting itself as perfectly healthy throughout).

At the end of January similar issues reappeared. Having learnt our lesson from last time we failed over to the new controller very quickly – this didn’t have the hoped-for effect but we convinced ourselves that things were recovering. In hindsight improvement was because it was late on Friday afternoon and the load was decreasing. On Saturday morning things were worse and the team reassembled to investigate. This time we identified one of a pair of network links which was reporting errors. The pair of links were bonded together to provide higher bandwidth and a degree of resilience. We disabled the faulty component leaving the link working but with half the usual throughput (but still able to handle normal usage) and this fixed things (we thought). Services were stable for the rest of the week but on Monday morning it was clear that there was still a problem. At this point we failed back to the original controller and things improved. Given that we were confident that the controller itself wasn’t faulty (it had been replaced at the start of the month) the implication was that there was a problem with the switch which is a much bigger problem (this is what one of these switches looks like). We’re now working with our suppliers to investigate and fix this with minimal impact on service to the University.

In the last few weeks we’ve had problems with the campus network being overloaded by outfall from an academic practical exercise, a denial of service attack on the main web server and thousands of repeated requests to external DNS servers causing the firewall to run out of resource – but they’re stories for another day.

Web proxy changes (reverted)

Unfortunately we’ve had to backout the change to the proxy config. We found that Windows XP clients didn’t handle the change properly and lost access to external web sites.

The good news is that the vast majority of clients worked fine so once we’ve developed a plan for handling the older machines we’ll try again (in the new year).

Disappearing messages to lists

We had a question last week about some messages sent to a local mailing list not reaching the members of the list. When we looked at the logs on the list server we saw that the messages were being discarded as duplicates/loops. This is an explanation of why this happens and how to avoid it.

Every mail message has identifying label associated with it which should be globally unique. This label is called a message-id (commonly shorted to msgid). The system we use to run our mailing lists (Sympa) relies on this to stop looping messages being sent to a list repeatedly. In the version we use at the moment the list of msgids that have been seen is only cleared out when the server is restarted for maintenance – this happens irregularly (later versions expire entries in the cache after a fixed time). This is a reasonably common technique to protect lists for mail-loops – I remember implementing it in the locally written MLM when I worked at Mailbase.

The system deliberately sidelines the message silently because it thinks this is a possible loop and sending a message to the sender has a fair chance of making things worse.

Unfortunately some mail programs will create messages with identical msgids. I believe that some versions of Outlook do this if you use the “Resend” option on an existing message. The workaround is to not use “Resend” unless you’re resending a message that failed to deliver. Some old versions of the Pine mail program generated duplicates occasionally because they used the current time to create the msgid but missed out one of the components(hours, minutes or seconds – can’t remember which).

We’ve found another instance in which Outlook will send messages with identical msgids and that’s using templates. If you use Outlook templates in non-cached mode (more specifically if you use a template created when in non-cached mode) then messages created from that template will all have the same msgid. See discussion at

The suggested workaround for this is to change to using Outlook in Cached mode (see and then recreate the templates (you need to create new templates because the fault is attached to the template). If for some reason cached mode isn’t suitable all the time (for example if you regularly use different desktop machines) you just need to turn it on when creating the template.


Outlook, text formatting and signatures

Summary: appending three spaces to the end of each line of a text block (eg a signature block) in a plain text message will stop Outlook from joining lines and messing up your formatting.

Long version…

For a while now we’ve had niggling issues with formatting of plain text email signatures in Outlook.
Problem was that a signature sent as

Paul Haldane
Infrastructure Systems
Information Systems and Services, Newcastle University
Claremont Tower
Claremont Road
Newcastle upon Tyne

Would be displayed (by default) in Outlook as

Paul Haldane
Infrastructure Systems
Information Systems and Services, Newcastle University Claremont Tower Claremont Road Newcastle upon Tyne

I don’t understand why the last line isn’t joined on to the penultimate line but I assume that it’s another feature of Outlook’s rendering algorithm.

NB That’s not my real email sig – the one that I use is

Paul Haldane
Manager, Infrastructure Systems
Information Systems and Services
Newcastle University

The example I’ve used at the top has characteristics which lead to the problem appearing while my real sig doesn’t (which had been one of the puzzling factors during the investigation).

The correct rendering can be shown by the recipient selecting “restore line breaks” when looking at a message or un-ticking “Remove extra line breaks in plain text messages” (Options->Preferences->E-mail options). Even if we decided that changing the default setting for University managed machines to not remove extra line breaks was a good idea, we obviously can’t control the settings for external recipients.

One of the reasons that this issue was hard to track down was that not all sigs demonstrated the problem. Mine didn’t; our director’s did and our VC’s did (which is one of the things that gave the issue visibility).

Comparing the original versions of the three I guessed that the common factor might be line length. Both of the problem sigs had longer lines than mine – split was somewhere between 38 and 44 characters. More testing …


o three four five EOL
0000 xxxx 1111 XXXX 2222 xxxx 3333 XXXX 4
One two three four five EOL


One two three four five EOL
0000 xxxx 1111 XXXX 2222 xxxx 3333 XXXXx
One two three four five EOL


One two three four five EOL
0000 xxxx 1111 XXXX 2222 xxxx 3333 XXXX
One two three four five EOL



One two three four five EOL
0000 xxxx 1111 XXXX 2222 xxxx 3333 XXXX 4 One two three four five EOL


One two three four five EOL
0000 xxxx 1111 XXXX 2222 xxxx 3333 XXXXx One two three four five EOL


One two three four five EOL
0000 xxxx 1111 XXXX 2222 xxxx 3333 XXXX
One two three four five EOL

So the breakpoint is 40. Lines after that are joined.

One unexplained fact was that the longest line in the VC’s sig was

Vice-Chancellor: Newcastle University

Which if you count is only 37 characters. However previous attempts to fix the problem by appending spaces to the end of the line (see below) meant that the line had two non-breaking spaces and a space at the end bringing the length to 40. (Non-breaking spaces can be explicitly inserted by typing control-shift-space in Outlook’s message editor but there might be some cleverness going on that converts three adjacent spaces to a mixture of non-breaking and real.)

Tests and investigation had got us a reasonable model for when the problem would happen (and an explanation for why I didn’t see it with my sig). We didn’t yet have a solution.

Internet folklore suggests that adding three spaces to the end of each line (or two spaces at the start; or a tab at the end – opinions vary as to which is the most consistent) will result in messages being rendered in Outlook as intended.
I tried appending three spaces to each non-empty line in the input. This gave the desired behaviour; lines were rendered correctly by the recipient’s instance of Outlook (no matter what their setting for removing extra line breaks was).

I was just looking back through my open tabs to put in some references to the Internet folklore that I’ve mentioned and spotted a very informative post that I must have consistently skimmed over.
On mtruesdell says the following …

Every message starts with continuation off.
Lines less than 40 characters long do not trigger continuation, but if continuation is on, they will have their line breaks removed.
Lines 40 characters or longer turn continuation on. It remains on until an event occurs to turn it off.
Lines that end with a period, question mark, exclamation point or colon turn continuation off. (Outlook assumes it’s the end of a sentence?)
Lines that turn continuation off will start with a line break, but will turn continuation back on if they are longer than 40 characters.
Lines that start or end with a tab turn continuation off.
Lines that start with 2 or more spaces turn continuation off.
Lines that end with 3 or more spaces turn continuation off.

This is from testing against Outlook 2007 – he’s obviously got more patience than me. It would be so much easier if Microsoft published the algorithm that Outlook uses – at the moment there’s nothing to say that this behaviour won’t change in future versions.

UCISA-IG Service Availability Event

Just back (well last week – taken me a while to write up my notes) from the UCISA Infrastructure Group (UCISA-IG) event in Liverpool – “Service availability – is 24x7x365 really necessary?”. These notes are very rough but I’d rather get them out now while reasonably fresh.

This sort of event is always worthwhile not just because of the “formal” talks but also the chance to meet colleagues from other institutions and talk about common issues. Doing this face to face allows you to be a bit less discreet than you would be on a mailing list :->. Topics that came up in passing were account management systems (why does everyone seem to write their own?); how IT services are organised internally (by platform/by layer/at random) and the difference between working in a large IT service (where most people are specialists and much of what your colleagues do is a black (or at least grey) box) and a small organisation where the IT person is likely to do network/storage/desktop/servers/everything else (because there’s no-one else).

Whilst the event was interesting and useful I felt the title was a bit misleading – most of it was talking about DR and BC (Business Continuity) rather than whether universities need 24×7 services. My instincts are
1. Not everything needs the same level of availability
2. If more services were designed to use asynchronous communication and message queues we wouldn’t have to have such a broad shutdown of services on the (hopefully rare) occasions that we need to shutdown one of the fundamental systems. Constructing a concrete example; if a member of the University needs to update their address does it matter if the database change happens instantaneously or is it OK if the change is made within half a day. The important thing is that they should be able to submit their change whenever is convenient (and that they get some feedback when it’s complete). Moving to reliable loose coupling should reduce our need for everything running all the time.
3. Some systems are intrinsically easy to make resilient. My favourite is mail relaying (not the complete mail service – just the pure relay). Because each transaction is independent and there’s a standard mechanism to distribute requests between servers (MX records) it’s easy – you just add more servers (though there was the problem with large MX sets and poorly configured remote systems – I think that hit us when we got to 10 entries in our MX list).

Opening session was David Teed talking through the processes you would use to set your recovery targets for services. Not everything needs to be recovered in 4 hours – working through Business Impact Analysis and leading to an ICT Recovery Statement (what you’ll recover, how long will it take and what workarounds will the business use to cope in the mean time). This leads to list of resource requirements and allows you to manage customer expectations and cost justify solutions.
Idea is that you then invest – matching the requirements exposed by BIA (not going overboard on making things over resilient – though you may do more if it brings other benefits). All very sensible and if we haven’t done something like this already we should.

Next Adrian Ellison, LSE talking about working from home (WFH) as an item in the DR/BC toolbox.
Often a big part of the BC plan but there are (of course) issues. DR moved up LSE agenda after 7/7.

Alternative accommodation on the larger campus might be a better solution (as it maintains the face to face contact which is lost). As part of planning
allocate suitable alternative for each critical activity (making sensible assumptions on loss (of access to) buildings).
Reciprocal arrangements with neighbouring institutions may be a possibility.
Not everyone can work from home (and some can’t do all of their hobs) – specialised equipment/other people.
WFH isn’t sustainable for long.

To support WFH you need
– Resilient dual-path network with OOB access via 3rd party ISP (tested regularly)
– Robust DC strategy with resilience
– Likely that you’ll need to scale up remote access systems quickly. For Citrix etc will probably need extra licences
– Think about how you do remote support (LSE use LogMeIn Rescue)
– Separate VPN/remote access for IT staff?

Telephony – mobile networks may (will) become overloaded
Will need to divert key numbers to alternate locations (pre-arrange with supplier)
May be able to divert to external numbers (advanced IPT – “remote office”)

Remote learning – if lots of students are accessing rich content do we have bandwidth to cope (to halls?)

Information security is important but if you make things too difficult people will create their own workarounds which will be worse in terms of security.
Make clear that there is personal responsibility for security of data/systems under their control.
Managing people – motivation – all more difficult when remote – need, f2f meetings (off-site)
Off-site working relies on trust

Talk from Oracle/Strathclyde about how the availability features of 11g can help with resilience. The idea of automatic storage management (ASM) which (as I understood it) replicates data across multiple low cost modular storage arrays seems like a nice idea. Anything that helps us to move away from big, expensive boxes sit in the middle of everything (and tend to be fussy eaters).
Active data guard (ADG) – replication of data – can use replicated copy for read-onlt queries/BI etc as well as a backup to use as when the primary site fails (so that you’re getting some use out of the standby kit).

Talk by Adrian Jane, University of Plymouth on how they use IPstor appliances to virtualise storage. These boxes sit between the real storage and the machines using the storage. This allows you to do mirroring, migration and similar without downtime and without changing the configuration on the clients. IPstor boxes are hardened Linux servers. They obviously need to be replicated (as all the storage traffic flows through them) and reasonably chunky (for the same reason) Plymouth are using something like HP 585 G6 quad cpu (6 core), 32G ram, 4x 8Gb HBAs.
As well as the obvious advantages, there’s also the benefit of simpler client config – all the mirroring is done in the IPstor.

Last talk was Richard Smith, Sheffield Hallam University about how they use VMware. They moved further with VMware than we have – over 200 VMs (though I guess if we count up all of our Xen and VMware guests and add on all the the Solaris zones for SAP we’d get a similar number). Running higher numbers of guests per host than us (50 as a matter of course, up to 120). Vmotion allowed them to migrate services to new data centre with no downtime.
Vsphere can now use HP’s iLO technology to power up extra servers to cope with peak loads (and I think to reset hardware that appears to be hung).
Nice feature was the use of template VMs for Terminal Services servers – this let SHU scale up their TS capacity very quickly to cope with extra load when large numbers of people worked from home because of the bad weather at the start of the year.…ailability.aspx

Failure is inevitable (or is it?)

A very thought provoking paper on why complex systems fail
How Complex Systems Fail by Richard Cook

This was referenced by John Allspaw on his blog (

One of the most interesting points for me is number ten “All practitioner actions are gambles”. Whenever we do (or don’t do) something – upgrade a package, reboot a server, restart a service there’s a risk that it’s not going to end happily. We can (often) minimise the risk by trying the operation in a test environment first and mitigate the consequences of failure by having a backout plan (and backups :->) but sometimes our experience tells us that we should “just do it” and it will be fine. Most of the time it is but sometimes it isn’t. This leads into point seven “post accident attribution to a ‘root cause’ is fundamentally wrong”. Yes there will be a trigger whether it’s a dodgy disk controller or the (apparently) unrelated package updated last month but the _real_ problem is that the odds are against us.

When running production services we have to balance the costs of testing, scheduled downtime, redundancy against the probability of failure (and the cost of resulting unscheduled downtime). Because of these costs we may (and do) run systems with known issues. We obviously can’t do this if the problem has a direct impact on service but if it’s a failure which is masked by redundancy (say a single fibre path) and the costs and risks involved in investigating the failure and bringing the system back to perfection are judged to be too high then we might decide to leave well alone. This doesn’t make me happy but I’ve got a clearer framework to think about it now.

This ties in with “Better”, a book by Atul Gawande ( that I’ve just read. He’s talking about how surgeons in particular (he’s a surgeon and doctors in general make decisions and how they try to improve. A big part of this is the need for measurement and reflection – (a) collect data and (b) think about it.

Asset registers

I’ve been musing recently (actually for ages but the issues have only recently crystallised) about asset registers and/or inventories.
I’m talking about servers here rather than desktops – I’m sure the issues overlap but there are differences both in terms of scale and variety.

We need to have a an up-to-date asset register with information about where things are, how much they cost and who paid for them (both because the University says and and because it’s the right thing to do).

Since this information doesn’t obviously help us run services it gets viewed as an admin overhead and tends to be updated infrequently (usually just before the annual deadline).

My feeling is that the best way to get this info kept up-to-date is to use it for operational things – in my ideal world you would make config changes to a single database (which would apply policies to fill in defaults and highlight inappropriate settings) and the system would then generate the correct entries in all the other databases (from the config files we use for automated installs with jumpstart and kickstart to the entries in the Nagios and Munin monitoring systems).

We need to hold two type of information about systems. First the `financial’ data (cost, date of aquisition etc) and then the `system’ data (services provided, rack location, switch port, RAM installed, OS version etc).
Most (all) of the first set of data is fixed when the box arrives here and won’t change over time. Capturing this generally involves a human (gathering info from purchase order and physical items and sticking an asset tag on the box) and should be part of our goods inwards process.

Much of the second set of data will change over time and should be maintained automatically (OS version, RAM, network interfaces). Makes much more sense for the computers to keep this up-to-date. Stuff like which packages are installed and which services are running should be controlled by a configuration management system like cfengine. The CM system and the inventory need to be linked but I don’t think they’re the same thing.

There’s a set of data which (at least in our environment) changes less frequently. An example is location; most servers arrive, get racked and sit in the same rack location until they’re retired. We occasionally move existing servers both between server rooms (to balance up site resilience) or between racks within a room (perhaps if we’ve retired most things in a rack and want to clear it out completely). This process obviously involves a human and part of the process should be updating records to show new location. I’m keen to back this up with a consistency check (to catch the times where we forget). It should be possible to use the MAC addresses on the the network switches to find which servers are where (since there is a many to one mapping between network switches and rooms). Most of our server rooms have a set of rack in the middle with switches in and servers are connected via structured cabling and patch panels so this doesn’t help with moves within a room however we’re gradually moving towards having switches in the server racks.

I’ve been looking for an open source system that will help us do this. Open source isn’t an absolute requirement, open interfaces are (because I want to use this information to drive other things). I know we could lash together a MySQL database and web frontend to hold details entered by hand. I’m sure we could extend this to import info from system config reports generated on the servers themselves and sent back to the central server. The thing that stops me doing this is the feeling that someone out there must already have done this.

I recently came across the slide deck from Jordan Schwartz’s 2006 presentation on Open Source Server Inventories

Which referenced the Data Center Markup Language site ( which has some interesting ideas about representing information about systems in a portable way. DCML seems to have gone quiet though.

Also referenced the Large Scale System Configuration group’s pages –
Lots of interesting thoughts about how large systems could/should be managed (but nothing I could spot to solve my immediate problem).

I installed a number of asset tracking systems. None (so far) have gone click with me. It’s quite possible that I’ve missed the point with some of them but here’s my quick take.

Asset Tracker for RT
We don’t use RT (we use Peregrine’s ServiceCenter) so integration with RT doesn’t win us anything. As far as I can see this relies on manually entered data (though I’m sure it would be possible to automate some population of the asset database).

OCS Inventory NG
I quite liked this one. Agents installed on the clients generate XML files which are sent back to the server. My main objection was the large set of prerequisites for the agents which made deployment painful. My ideal agent would be a very lightweight script with very few dependencies which used standard OS tools to gather info and then sent the info back to the server as simply as possible.

This one definitely looks interesting (but perhaps not for this immediate problem) and from a brief skim of the wiki it would be useful for getting a view of rack capacity for planning etc and dependencies. Some comments on the mailing list imply that its primary purpose isn’t an inventory system. No obvious way of doing bulk imports (but from a look at the database it wouldn’t be impossible).

Simpler version of Racktables? No obvious way of doing bulk imports.

This seems aimed more at auditing of desktops (makes a big play of the ability to get the serial number of the monitor which is undoubtedly useful if you’ve got hundreds of them, but all our servers are headless). I like the model of a simple text file generated on the client which is then imported by the server. Would need to produce Solaris version of agent.

In the longer term I expect that we’ll want to populate the asset database in Peregrine so that we can have better integration with ServiceCenter. I sure that’s the right thing to do but I suspect that the Pegegrine asset database will end up being an automatic (subset) replica of the main database (because there’s some stuff that will be best kept separately.

Sysadmin book

I’ve just finished reading the new edition of “Practice of System and Network Administration”.
It’s just as good as I remember the first edition being. It’s not a technical book (in the sense of teaching you particular incantations) but instead talks about how you should approach running services right from less obvious areas like budgeting to how to work with building services professionals to get the right sort of machine room facilities. Also spends a lot of time on how
to build successful teams and how to have happy admins _and_ happy customers.

The sysadmin team in this book’s context would cover everyone in ISS and it would be a useful book for everyone to read (they even have a chapter for non-technical managers :->) – more relevant to us than Clive Woodward’s book on Winning!. One of the themes running through the whole book is the need for cross-group involvement and communication (between admins and customers; different groups of admins; admins and suppliers; etc) so that both sides understand what the real requirements and constraints are.…d/dp/0321492668