Earlier this week we experienced a storage server problem on the
Modwest shared hosting system. I just wanted to give you the low-down on everything that happened and how we responded.
Late Monday night, the server had warned us about
a possible hard drive issue; it contains a whopping 24 drives, so one
drive having a problem is no big deal. So, the next day, we prepared to
replace the drive, which wasn't marked as "bad", just "unavailable" for
some reason.
To make a long story short, we hit a glitch related to the confluence of Sun's Solaris operating system and LSI RAID hardware. The server crashed,
and wouldn't restart. And, the 500GB of customer data on it was
inaccessible.
Because something similar happened last year too, we immediately began
preparing an alternate server to take this one's place.
The new one runs good ol' Debian Linux.
But we needed access to the current website data. We opened a
ticket with Sun commercial support and they couldn't figure it out
beyond "hardware problem", but agreed that reinstalling the operating
system might work. Some 12 hours later, after numerous attempts, our
server team succeeded in accessing the customer data and began
transferring it off the problem server onto the server we had standing by
and ready to take its place.
It takes a long time to copy millions of files totaling nearly 500GB of data between servers, even on our fast internal network, and
so then a waiting game began.
We know our customers' sites are important to them, and we are analyzing all aspects of the entire event so we can improve our
operation and prevent a similar recurrence in the future.
During the outage, our awesome Support Team had the chance to speak with
countless customers about the issue. We did our best to provide timely,
accurate information about the work in progress, and kept the public Status
Page up to date frequently. Did we do ok?
The Ops Team put in a tremendous effort over two days to get things up and running
again as safely and quickly as possible. If you have a congratulatory
haiku for the guys, please post it here:
https://feedback.modwest.com/topic/103/Soothing_haikus_for_Modwest_engineers
We've recently hired a storage specialist who is now part of the team responsible for monitoring, maintaining, and improving our entire infrastructure. I'm much more confident now that the sorts of problems we endured with you this week will be less likely in the future. Thanks for your patience and understanding, and feel free to send us an soothing haiku!
-JM