« March 2008 | Main | May 2008 »

April 2008

April 26, 2008

Email Delays Resolved

Mail on the Modwest shared hosting system is flowing again now. Nothing was lost, but throughout the day yesterday (Friday), incoming messages could have been delayed several hours, webmail didn't work for most customers, and about half the mailboxes we host were totally unavailable for several hours last night.  A blow-by-blow is available on our offsite status page.

At the high-water mark for yesterday's problems, close to 200,000 messages were hung up in the queue, awaiting delivery to customer mailboxes. That included messages coming in to our support team, so it wasn't until late last night that a bazillion support request emails came in. Needless to say, we're a little behind.

The problem, as best as we could tell, was some sort of deadlock issue with our Cyrus IMAP software. The resulting behavior was all the 'stuff the message in a mailbox' processes believed they needed to wait for access to do so. All of them. I can sort of imagine them all politely saying "no please, I insist, you first" to each other, with no deliveries happening at all.

Anyway, the situation is resolved now; not necessarily in a permanent way, but everything is working at the moment and we're monitoring servers closely. Sorry about the trouble.

-JM

April 25, 2008

E-Mail Delays

We are currently working on an issue that has caused email delivery delays today. For updates, See the Modwest System Monitor at http://status.modwest.com/.

April 20, 2008

On the Tree of (Hardware) Woe

     "Contemplate this on the Tree of Woe..."

This is a line from Conan the Barbarian uttered by the villain Thulsa Doom after capturing Conan and beating him to within an inch of his life. Looking down upon the fallen warrior, Doom (played by Vader-esque James Earl Jones) then turns to his henchman Rexor and issues a directive:

      

"Crucify him."

In response, Conan collapses in exhausted anguish.

Last week wasn't quite that bad for our hardware team, but it was a rough one. We had two important managed server customers suffer catastrophic hardware issues which required hours -- even days -- of downtime to fully repair. While dissimilar, both problems were storage-related.

First, a mini-primer on storage redundancy and fault tolerance:

There are two basic ways to provide storage redundancy in a standalone dedicated server: software and hardware.

  • The benefits of the software strategy is lower hardware cost, which we can then pass along to the customer.
  • The hardware solution is better at detecting and handling drive failures but costs more to deploy.

Neither strategy is immune to fault, as we've been painfully reminded over the past 10 days. Both affected customers themselves host dozens of their own customers on these servers, and so the interruptions were particularly undesirable for them.

In the first case, the server featured hardware-based redundancy, a RAID-controller made by 3Ware. It just so happens that the particular driver for this controller has a rare bug when installed on servers running a certain Linux kernel version. The bug can cause generalized data corruption (!), and in this case, we discovered that various system configuration files in /etc were getting periodically scrambled, removed, and relocated!

This is a live server providing business-critical functionality to our customer and his customers, and yet the only fix was to re-install the operating system running an updated driver to ensure data integrity. This required an overnight re-install and data-restoration procedure, but when the server was back up and running, a few configurations and software versions were different, and thus we had to work through  dozens of small web application and email glitches before everything was ship-shape.

That would have been challenging enough, but around the same time, another managed server suffered a hard drive failure.  This machine utilized the software-approach to storage redundancy, and while the  drive failure was indeed detected, what wasn't detected was that the other hard drive was on its last legs and could have also failed at any moment. The server was in an absolutely precarious state when it finally alerted us to the issue.

Ordinarily, when one hard drive in a mirrored pair fails, the procedure is to shut down, replaced the failed drive, reboot, and instruct the system to re-mirror everything. In this case though, the server's remaining drive was in such bad shape that we suspected it wouldn't make it through the reboot, and that all current data on the machine would be lost.

The challenge was therefore to ensure that we had a snapshot of the most current data before beginning the surgery. But try making a fresh backup of 30+GB of data off a damaged hard drive that could fail at any moment; it's a slow, slow process, and took close to 18 hours to complete. It was only upon that completion that we could proceed with a full reinstall, reconfiguration, and restoration from backup (which took much of the next day).

I'm happy to report that as of Friday afternoon, thanks to long hours of work by our best hardware guys, both servers are (to the best of our knowledge) repaired and fully functioning. 

Our managed servers as a rule boast the highest availability of all our services, with many of them enjoying near-100% historical uptime. But the fact remains that hardware components, and especially moving parts such as hard drives, simply wear out and break. We do what we can to ensure that when they do, repair and recovery is relatively painless, but this past week presented a 'perfect storm' of hardware problems. 

By the way, Conan's friend Subotai rescued him from the Tree of Woe, and Conan returned heroically to vanquish the enemy.

-JM




Powered by TypePad

June 2008

Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30