Saturday Night Maintenance
No, it's not a new John Travolta movie.
We're performing hardware maintenance on our shared hosting system on Saturday night, July 26th, 2008, starting at about 10PM Mountain Time. This will result in up to five hours of hosted websites being unavailable. Email (POP3/IMAP/SMTP) will work, but webmail won't.
We haven't had to schedule a five hour maintenance window in a few years, so we'll also be sending out email to affected customers and their technical contacts.
While our Modwest System Status page gives a brief overview of why this maintenance is necessary, I thought I'd take a moment here to provide some gory technical details.
As previously mentioned, we're in a transition phase, in which we are gradually replacing the storage hardware and software of our shared hosting system. This can mostly be done transparently, but we're unavoidably relying on vendors' hardware and software working properly.
And there's the rub. We need to replace an Areca RAID controller that isn't behaving, and that requires an operating system reinstall (since the new controller is from a different vendor, LSI); hence, a five hour maintenance window.
I asked our Operations Manager to explain the technicalities of the problem this maintenance window addresses, and here's how he described it:
While it's not clear exactly who-is-causing what, what is clear is the Areca driver tries to de-reference a NULL pointer, and this is either because the adapter screws up, or the driver screws up somewhere. The result is a Solaris kernel fault, pointing at the arcmsr driver, and apparently an adapter lockup. It's not 100% clear what causes this condition. It could be the driver not handling some buffer appropriately, it could be the card sending an error that the driver doesn't handle. It's pretty likely though the issue is completely inside of the arcmsr driver and Areca hardware. One thing we did discover that means we HAVE to replace the Areca hardware is that in JBOD mode (which is how we use it, since we're using Solaris' ZFS superset of RAID functionality), any disk failure seizes the whole card up until the failure clears, or maybe until some apparently long timer clears. SATA and SAS have ethernet-like link failure detection. You know within milliseconds when the cable is pulled. The Arecas in JBOD mode seem unable to handle hot-swap of any type, or even failures of any type. When we tried to get Areca to address it, all we received was vague "you must have a failing drive" answers, which for a RAID card is a bad answer. Even in JBOD mode the controller should signal/propagate an error. Solaris would handle this condition.
Then there's the boot selection. All logical drives appear in the boot selection. Either the list fills up or the Areca's only show drives on the first controller. That's a problem if you want to be able to boot an alternate drive on a second controller.
Catch all that?
The point is that we absolutely need our core storage technologies to be 100% rock solid. Now that we've detected this problem, we must address it ASAP, and that's what this Saturday Night Maintenance is all about.
-JM
Comments