Sunday, March 18, 2012

Phresheez Has a Yard Sale

Kaboom

I wouldn't have said that Phresheez had an ironclad disaster recovery plan, but at least we had a plan. We do mysql database replication back to my server on mtcc.com and do daily backups of the entire database. We have backups of the server setup, and more importantly have a step-by-step build-the-server-from-the-disto along with config files are checked into svn. We only have one active production server, so that implicitly accepts that significant downtime is possible. On the other hand, we have been running non-stop since 2009 with exactly one glitch where a misconfigured Apache ate all its VM and wedged. That lasted about an hour or so -- hey, we were skiing at the time, so all in all not terrible for a shoestring budget.

Our downtime doesn't take into account routine maintenance, and I had been in need of doing a schema update on our largest table, the GPS point database. So I happened to wake up at 2am and decided to use the opportunity to make the change. Nothing complicated -- just take the site offline and make a duplicate table with the new schema. That's when the fun began. Each of several times I tried, the master side gave up complaining about something going wrong with the old table. I then tried to do a repair table on it, and it bombed too. Strange. After the fact -- a mistake, but it didn't matter as it turns out -- I decided to do a file system copy of the database table file. Death. Dmesg is definitely not happy either about the disk. I tried to see if the index file was ok, and same problem. I tried other large tables, and they seemed ok. A mysql utility confirmed that it was just the GPS point database, even though that was pretty bad by itself.

So... I was pretty much hosed. Something had blown a hole into the file system and torched my biggest table -- some 15 Gig big. Fsck with some prodding discovered and repaired the file system, but couldn't salvage the files themselves. So it's restore from backup time. Ugh.

There were two options at this point: do a backup of the slave or just copy over the slave's data file. I wasn't entirely coherent (it was early), and decided to give the first a try. Here's where the first gigantic hole in our strategy came in: either method required that a huge file be copied from the slave server to the master. Except the slave is a machine on a home DSL uplink getting about 100 KB/sec throughput: scp was saying about 12 hours transfer time. Oops.

The long and short is that after about 12 hours, the file was copied over, the index was regenerated and Phresheez was up again no worse for wear as far as I can tell. A very long day for me, and a bunch of unhappy Phresheez users.

Post Mortem

So here's what I learned out of all of this.

  • First and foremost, the speed of recovery was completely dependent on the speed of copying a backup to the production server. This needs to be dealt with some way. First might be copying the backups to usb flash and finding somebody with a fast upstream to be able to copy stuff to the production machine. Better would be to spend more money per month and put the slave on its own server in the clould. But that costs money.
  • Large tables are not so good. I've heard this over and over, and have been uncomfortable about the GPS point table's size (~300M rows), but had been thinking about it more from a performance standpoint than a disaster standpoint. I've had a plan to shard that table, but wasn't planning on doing anything until the summer low season. However, since the downtime was purely a function of the size of the damaged table, this is really worth doing.
Disaster on the Cheap

The long and short of this is that when you have single points of failure, you get single points of failure. Duh. The real question is how to finesse this on the cheap. The first thing is that getting access to copy the backup over the net quickly would have cut the downtime about an order of magnitude in this case. Sharding would have also cut the downtime significantly, and for that table really needs to be done anyway.

However, this is really just nibbling at the edges of what a "real" system should be. Had the disk been completely cratered, it would have required a complete rebuild of the server and its contents and it would have still been hours, though maybe not the 12 hours of downtime we suffered. Throwing some money at the problem could significantly reduce the downtime though. Moving the replication to another server in the cloud instead of on home DSL would help quite a bit because the net copy would take minutes at most.

A better solution would be to set up two identical systems where you can switch the slave to being a master on a moment's notice. The nominal cost is 2x-3x or more because of the cost of storing the daily backups -- disk space costs on servers. The slave could be scaled down for CPU/RAM, but that only reduces cost to a point. Another strategy could be to keep the current situation where I replicate to cheap storage at home and keep the long term backups there, but keep a second live replication on another server in the cloud. The advantage of this is that it's likely that a meltdown on the master doesn't affect the slave (as was the case above), so a quick shutdown of the cloud slave to get a backup, or switching it over to be a master would lead to much better uptime. Keeping the long term backups on mtcc.com just becomes the third part of triple redundancy and is only for complete nightmare scenarios.

Is it worth it? I'm not sure. It may be that just getting a fast way to upload backups is acceptable at this point. One thing to be said is that introducing complexity makes the system more prone to errors, and even catastrophic ones. I use replication because it would be unacceptable to have 2 hours of nightly downtime to do backups. However, mysql replication is, shall we say, sort of brittle and it still makes me nervous. Likewise, adding a bunch of automated complexity to the system increases the chances of a giant clusterfuck at the worst possible time. So I'm cautious for now -- what's the smallest and safest thing I can do get my uptime after disaster into an acceptable zone. For now, that's finding a way to get backups onto that server pronto, and I'll think about the other costs/complexities before I rush headlong into it.

So What Happened to the File System Anyway?

At some level, shit just happens. It's not whether something will fail, but when it will fail and how quickly you can recover. But this is the second time in about a month or so that I've had problems that required fsck to come to the rescue. My provider had recently moved me to a new SAN because the previous one was oversubscribed. Did something happen in the xfer? Or is their SAN gear buggy leading to corruption? I dunno. All I know is that I haven't had any reboots of any kind since I moved to the new SAN so there shouldn't have been a problem unless it went back before the SAN move. I sort of doubt that the underlying Linux file system is the cause -- there's so much mileage on that software I'd be surprised. However, after fighting with my provider about horrible performance (100kb/second transfer for days on end) with their SAN's and now this... I'm thinking very seriously about the options.