Operations

May 08, 2008

Scotty, I need more power!

We haven't been shy about acknowledging our shared hosting performance issues this past year. It all goes back to our core storage architecture, a SAN-attached NFS server. The short version is that there's too much reading-and writing happening too fast for this architecture to provide exceptional performance 24/7/365. Backups, which need to inspect every file on the storage system for changes, are taking more than 12 hours to complete now, degrading performance along the way.

While there can be many reasons for perceived slowness, we know our storage system performance is less than perfect from time to time, and we have seen the solution, in our mind's eye (and engineering whiteboards): Clustered storage. We'll be able to add machines and add storage and add read/write capacity simultaneously, in a single, seamless, migration-free step. We've been working on it for about a year now. We have a ways to go still, and I'll post an update in June on the progress.

In the meantime, we have known that we really need to do something about this -- and today we have. We've deployed additional storage hardware -- not as a permanent fix, and not even as a step towards the new system, but as an 'overflow' system to which we can move a handful of busy accounts.

One of the first we've moved is the webmail system inside OnSite.  If you use it regularly, I think you'll find it considerably faster today than it was a few days ago. In fact, all of OnSite should feel faster now due to software upgrades we deployed last week.

We are still ironing out some quirks with this 'overflow' hardware, but next week we should be able to migrate a small number of additional sites to this system. If you've previously contacted us about performance issues with complex, template-driven sites, let us know by emailing feedback at modwest dot com that you're interested in having your files moved to the new system.  We'll handle everything, and you won't have to do anything different,  but the move process can cause a site to be unavailable for up to an hour or so, depending on the amount of data and number of files.

This isn't a permanent fix, but it should help mitigate our recent less-than-stellar filesystem access times occasionally experienced. And again, the longer term fix will solve the NFS-performance problem once and for all.

So, short term, we sent our request for more power to the engine room, and we've managed to eek out just a bit more. If only we had a dilithium crystal... more news soon.

-JM










April 20, 2008

On the Tree of (Hardware) Woe

     "Contemplate this on the Tree of Woe..."

This is a line from Conan the Barbarian uttered by the villain Thulsa Doom after capturing Conan and beating him to within an inch of his life. Looking down upon the fallen warrior, Doom (played by Vader-esque James Earl Jones) then turns to his henchman Rexor and issues a directive:

      

"Crucify him."

In response, Conan collapses in exhausted anguish.

Last week wasn't quite that bad for our hardware team, but it was a rough one. We had two important managed server customers suffer catastrophic hardware issues which required hours -- even days -- of downtime to fully repair. While dissimilar, both problems were storage-related.

First, a mini-primer on storage redundancy and fault tolerance:

There are two basic ways to provide storage redundancy in a standalone dedicated server: software and hardware.

  • The benefits of the software strategy is lower hardware cost, which we can then pass along to the customer.
  • The hardware solution is better at detecting and handling drive failures but costs more to deploy.

Neither strategy is immune to fault, as we've been painfully reminded over the past 10 days. Both affected customers themselves host dozens of their own customers on these servers, and so the interruptions were particularly undesirable for them.

In the first case, the server featured hardware-based redundancy, a RAID-controller made by 3Ware. It just so happens that the particular driver for this controller has a rare bug when installed on servers running a certain Linux kernel version. The bug can cause generalized data corruption (!), and in this case, we discovered that various system configuration files in /etc were getting periodically scrambled, removed, and relocated!

This is a live server providing business-critical functionality to our customer and his customers, and yet the only fix was to re-install the operating system running an updated driver to ensure data integrity. This required an overnight re-install and data-restoration procedure, but when the server was back up and running, a few configurations and software versions were different, and thus we had to work through  dozens of small web application and email glitches before everything was ship-shape.

That would have been challenging enough, but around the same time, another managed server suffered a hard drive failure.  This machine utilized the software-approach to storage redundancy, and while the  drive failure was indeed detected, what wasn't detected was that the other hard drive was on its last legs and could have also failed at any moment. The server was in an absolutely precarious state when it finally alerted us to the issue.

Ordinarily, when one hard drive in a mirrored pair fails, the procedure is to shut down, replaced the failed drive, reboot, and instruct the system to re-mirror everything. In this case though, the server's remaining drive was in such bad shape that we suspected it wouldn't make it through the reboot, and that all current data on the machine would be lost.

The challenge was therefore to ensure that we had a snapshot of the most current data before beginning the surgery. But try making a fresh backup of 30+GB of data off a damaged hard drive that could fail at any moment; it's a slow, slow process, and took close to 18 hours to complete. It was only upon that completion that we could proceed with a full reinstall, reconfiguration, and restoration from backup (which took much of the next day).

I'm happy to report that as of Friday afternoon, thanks to long hours of work by our best hardware guys, both servers are (to the best of our knowledge) repaired and fully functioning. 

Our managed servers as a rule boast the highest availability of all our services, with many of them enjoying near-100% historical uptime. But the fact remains that hardware components, and especially moving parts such as hard drives, simply wear out and break. We do what we can to ensure that when they do, repair and recovery is relatively painless, but this past week presented a 'perfect storm' of hardware problems. 

By the way, Conan's friend Subotai rescued him from the Tree of Woe, and Conan returned heroically to vanquish the enemy.

-JM



February 06, 2008

Filesystem Performance: Not out of the woods, yet

As we've previously explained (here and here and here), Modwest has been challenged periodically to coax more performance out of the centralized storage architecture of our shared web hosting system. When system loads start to reach the point at which performance suffers, we address the issue by adjusting NFS and backup configurations, identifying resource-intensive sites and figuring out how to make them less so, and publishing tips for site owners about how to make their sites less storage-dependent.

Well, the issue has returned:

Load
(The week-long gap at the end of January was the result of an operating system upgrade on the monitoring server which sort of deconfigured the graphing subsystem. Oops!)

As you can see from the graph (click it for a larger version), we're approaching 52-week highs again. Monday of this week was especially rough in the middle of the day.

We're pretty much run out of 'big stuff that needs fixing' on the current architecture, so now we make small changes, each of which could provide some incremental gain in performance via reduced utilization of the central storage system. In addition to finding a few super-frequent cron jobs (which rarely need super-frequency), one of the actions we've taken today is FTP rate-limiting. That means that if you're on a mega-fast uplink you might not see your full potential in FTP upload speed.

Of course what really needs to happen is a re-thinking of our centralized storage architecture. I'm happy to say we started that thought process more than a year ago. We already run a load balanced cluster of web servers, and a load balanced cluster of mail servers. Why not storage?

It's a hard problem, but a solvable problem to which we've been devoting a lot of engineering time over the past year. We're within three months or so of offering access to a "re-imagined" storage architecture that will not only address the issues we're currently experiencing, but will also open the door to some interesting features you won't find elsewhere.

-JM

P.S. I promise I'll publish more details about the new system soon, but as always, let us know of any questions by commenting below or otherwise contacting us.

January 16, 2008

Database Server Maintenance - Jan 17

As announced in OnSite and our public status page, we'll be taking our MySQL 5 server (db1.modwest.com) offline for up to thirty minutes tomorrow (Thursday) night around 7PM Mountain Time for system upgrades. If your site utilizes this server and you're a programmer (or have one nearby), you may want to consider making sure your web applications handle the maintenance window gracefully. (I'll post a technical comment below for one way to do so.)

Speaking of MySQL, the company that produces this excellent open-source database software was just  acquired by Sun Microsystems, an enterprise hardware and software company. Only time will tell, but we anticipate this will result in some powerful improvements to the already-great MySQL software.

-JM

November 21, 2007

Quick Performance Update

We've previously lamented and explained some shared hosting performance issues, and I'm happy to say that various countermeasures we've implemented over the past 60 days have helped a lot. The combination of hardware upgrades, software configuration changes, and the cooperation of dozens of understanding customers has improved performance a great deal. The graph below is indicative of the storage system's 'load average' we've experienced over the past 12 months, and as you'll notice, things are much happier now.

Loadgraph2 Also, we're in the midst of a storage architecture "re-engineering" that will make a more dramatic impact on our storage scalability, as well as opening the possibility of some interesting new features. Stay tuned for more updates.

-JM

October 25, 2007

Shared Hosting Performance Update

Back in May, we blogged about the performance issues our shared hosting storage system was facing, and how we intended to fix the problem.

In an effort to get some short-term performance boosts, we identified a few customers who were running resource-intensive scripts that were exacerbating the file system issues. We helped our customers reschedule these scripts and saw a marked improvement in performance.

In the last few weeks, system loads have started scaling up again, as you can see from the graph below.Loadgraph

We scrambled to identify the problem again, and discovered multiple possible causes, including:

  • Sites with over 250,000 files stored (!), which was causing backups to take longer. In a number of cases, the site owners did not need all these files, and so we were able to eliminate over 1,000,000 files from the file system, which helped a little.
  • Customers with complex scripts set up to run every 60 seconds -- if these scripts moved files or otherwise manipulated the file system, we coordinated scheduling changes to mitigate the effect. This helped a little.
  • Customers with daily backup and data-import scripts that ran at the same time as our global system backups, or during peak utilization periods. We coordinated scheduling changes with these customers to run these daily scripts in the evening, after traffic loads have subsided but prior to midnight when our backups begin. This helped a little.

Additionally, we made some server configuration changes which proved fruitful. While system loads remained high, the effect of the high load on hosting performance was reduced. So, performance has been much better but still not where we want it to be.

The good news is that we are making excellent progress on our storage architecture re-engineering project. We expect this change to have a major impact on performance and scalability of the Modwest shared system. We'll keep you posted as we have more to announce.

-JM

P.S., In the meantime, a note for technical site owners: Given our current situation, minimizing your site's reliance on the file system could help performance. One way to do this is to move your PHP sessions from files in the /tmp directory to your database. We just launched a new FAQ on this topic.

July 27, 2007

Happy Sysadmin Day

July 27th, 2007 is System Administrator Appreciation Day.

For the last several years, our most important systems administration work has been performed by Mike, our Operations Manager, and Tom, our Systems Engineer. Tom and Mike are responsible for big-picture infrastructure planning, setting up new servers, workstations, and services, and being generally responsible for the continuous monitoring, maintenance, and improvement of our technical infrastructure. On top of that, these guys get paged at absolutely any time of the day or night, 365 days a year, about:

  • hard drive failures
  • spam outbreaks
  • hacker attacks
  • servers running out of disk space
  • network cards going bad
  • datacenter temperature alarms
  • bandwidth provider interruptions
  • software failures
  • etc etc

Despite everything that can (and unavoidably does) go wrong, I'm proud of the highly reliable hosting we provide, which is largely due to the efforts of our system administration/Operations team.

-JM

 

July 20, 2007

End of auto-forwarding to AOL

We monitor just about everything at Modwest, and one of the types of alerts we occasionally receive is that one or more of the servers in our mail cluster has a backlog of messages awaiting delivery to remote destinations. That happened this week, and upon investigation we discovered just the latest occurrence of a phenomenon with which we're well acquainted by now: due to spam complaints, AOL was blocking all email from all Modwest customers hosted on our shared system.

How could this happen? Surely Modwest has strong anti-spam policies! (Yes, we do.)

Well, there is a small loophole: We allow our customers to set up mail forwarding, which means you can have mail addressed to yourdomain.com (hosted by Modwest) automatically forwarded to another email address elsewhere (such as AOL).  Sometimes, messages that get forwarded to a customer's AOL account are then reported via the "This is Spam" button at AOL; this results in Modwest being identified as the source of that message. Enough complaints, and AOL blocks Modwest altogether.

Therefore, beginning July 27th, 2007, we are going to reconfigure all forwarding rules destined for AOL addresses to deposit received messages in the main account mailbox. Further, we will be disallowing all new forwarding rules with an AOL address as the destination. We'll notify Modwest customers who currently have AOL forwarding rules soon.

We hate to make any changes that restrict features, but due to AOL's policies, continuing to allow these types of forwarding rules can adversely affect every customer of ours.  If you have any questions, don't hesitate to ask.

-JM

May 30, 2007

OnSite Upgrade: MySQL 5 now available

We recently launched a new version of OnSite with one big new feature: MySQL 5.0 is now optionally available for your database hosting needs. Thanks to Software Engineer Jacob and the rest of the team for making it happen.

The new feature allows Modwest shared hosting account administrators to create and manage databases running either MySQL 4.1 or MySQL 5.0, and you can even mix-and-match versions with multiple databases. If you're looking to 'upgrade' your database version, that'll require an export and import, a process now explained in our Knowledge Base.

As always, use the 'Report Bug' feature if you encounter any unexpected behavior or errors. Thanks!

May 18, 2007

Web Hosting Performance Anxiety

We have pretty high standards for web hosting around here, and I gotta say, performance on our shared hosting system has really stunk over the last few weeks, particularly on weekday mornings. (Our managed servers and VPS accounts are running fine.) This is especially bad because this is where most of our customers host their sites and when their visitor traffic is most important; we've been getting a lot of feedback about the problem.

One of our first clues that something was going wrong was the (recently described) misbehavior of MIVA stores.

Additionally, backups started taking longer to complete; backing up 1,079,139,162,282 bytes in bazillions of files takes a long time, and since the backup process itself is relatively resource intensive, it cannot be allowed to run once traffic starts picking up around 7-8AM.  But even with the manual termination of backups, we continue to experience intermittent performance slowdowns at other times during the business day.

Our server monitoring software (Nagios), which checks on the health of everything in the datacenter every couple minutes, has even detected connection timeouts from the main web hosting cluster. That means that at least one of the servers in the cluster was unable to accept a connection, let alone deliver a simple PHP page after 10 seconds -- from inside our network. Not good.

Additionally, most of us use the lori extension for Firefox, which tells you how long it takes for your browser's request to get the very first byte of information back, and then how long it takes to render all the HTML and load all the images. If the network and the server are working well, you'd expect that first byte to get to the browser certainly in less than one second. When we're experiencing performance troubles (like we have several times last week and this week), we've seen the first byte delayed over 10 seconds, again, from inside our network.  Not good!

All of these phenomena stem from an increasing load on our central file server, as illustrated in the trend apparent in the graph below:

Nfsload_3So file server load has increased rather dramatically in the last month or so, coinciding with MIVA problems, late-finishing backups, and generalized shared hosting performance issues. But is that the cause or the effect?

A theory we're pursuing is that one or a small handful of sites have begun exhibiting outside-the-norm behavior in terms of reads and writes to the file system. Unfortunately, great tools don't exist (at least as far as we can find) for this sort of NFS traffic analysis, so it's been slow going.

There's a long term solution that we've begun working on, but we need to identify the source of our trouble and restore our standard speedy service as soon as we can.

So, if you're hosting on the Modwest shared system and you think you might have recently started doing a whole lot of filesystem I/O, let's talk. Examples would be CMS/blogging software (such as Movable Type) which can periodically (or far too often) generate thousands of cache files, or an MLS   system which frequently downloads and processes thousands of images, or any software which requires (or leaves behind) thousands of files.

In summary -- we know about the problem, we want to fix it, and we're doing everything we can. If you have advice or questions, please add a comment below or shoot us an email.



Powered by TypePad

June 2008

Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30