Last week, we were unfortunate to have a major problem with our Exchange 2007 service. The server process (store.exe) that runs the mailboxes was crashing every 5 minutes or so. As the process that was failing is fundamental to the service working, it was important to try and diagnose the issue as quickly as possible.
The most difficult part of this diagnosis was that the error was so generic and wasn’t providing any relevant information as to what to look at. Exchange is a complicated and awkward piece of software at the best of times, so the problem was compounded by unsatisfactory logging of what was happening.
We were able to determine that the problem happened on both nodes of a cluster, so it looked likely that it was related to the database or even at a more granular level.
When dismounting databases to try and narrow down the issue, we noticed that when one particular database was dismounted that the problem went away. Sadly this meant a significant amount of downtime for the mailboxes on this affected database, particularly as we needed to obtain diagnostic information for Microsoft to investigate the problem, as it’s a fault that isn’t documented anywhere.
We narrowed down the problem to when a particular message (that was queued on one of our Hub Transport Servers) was trying to be delivered that the mailbox server crashed. We deleted the item from the queue and everything started to work OK. It wasn’t long before the problem reoccurred. We could then correlate (using the wonderful tool that is Powershell) that the second message that was causing a problem was scheduled to be delivered to the same mailbox as the first.
This indicated that the problem was common to one mailbox. We isolated that mailbox away from our production server to provide some stability to the thousands of other users that have mailboxes residing there. Once that mailbox was moved, the other mailboxes were fine, so it seemed as if the problem was really narrowed down.
We could reproduce the crash by replaying the problem message into the test system, so we were now at the stage where we could try and determine what it was about these two messages that caused the problem.
The problem seemed to be caused by some fault in a user’s rules. We had fortunately found the needle in the haystack and at the same time we were able to hopefully provide enough diagnostic information to Microsoft so they can thoroughly investigate why a problem with one user’s rules was enough to crash the entire server. That really is a big failing of Exchange.
One issue we noted was the user who had the problem mailbox was exclusively using Outlook 2003. If the user had moved to Outlook 2007, the problem would have been somewhat alleviated. Outlook 2007 alters the rules format.
It was an incredibly stressful couple of days and underlined the fact that email is a business critical system. We are still waiting to hear back from Microsoft, but should the problem reoccur, we should be much quicker in being able to diagnose the fault.