The data center my web server resides in moved recently; I - and everyone I share my rack with - had to move our equipment some time between the beginning and end of this month. So that happened today, and I'm sorry for the downtime (to anyone that noticed) but it's stuff like this that, honestly, keeps me wanting to run my own server.
I remember once at NCsoft something similar happened; I don't remember (and may not have ever known) the "why" but data centers changed; I just noticed because... wow, all those sysadmins and network admins and other operational staff sure seemed tired and cranky. But as I recall, it didn't actually translate to significant game down time - certainly not the 8 or so hours this site had today (to be clear, everything in the rack was being moved at once - I didn't do such a terrible job that it took me 8 hours to move one 2U system across town).
That's something I should keep in mind, in my opinion. Can the systems I develop be easily migrated if they have to be? Not just for the (generally rare) circumstance of the entire data center moving, or contracts changing, but even the day-to-day of machine failure.
At nearly midnight, having been home for about 15 minutes, I'm certainly not of the right mindset to make a list of proper steps or describe an architecture or any such thing; but I can at least try to remember this feeling later, because I don't really wish it on anyone. :-)


One interesting bit of advice I heard for server design was to not include any procedure for cleanly shutting down a server. You need to expect and handle complete dropouts anyways, so to ensure you do a good job of it, you assume that's the only way a server will ever leave service.
The book Release It also has a lot of interesting ideas on how to build reliable services.
Also, thanks so much for making my pile of books to read larger. I needed that. ;-)
Instead, design your application for failure. Like the previous poster suggests, build the app to "crash early" and crash safely. Loosely couple systems and isolate code into small services that do not share state. Don't rely on your database to ensure consistency and avoid putting business logic in the database. Seriously rethink using a single database as your primary authoritative datastore. Automate testing, and spend some time thinking about truly horrible modes of failure. (Total power lose, database corruption, DDOS, rm -rf *, etc)
Flickr has a policy of requiring developers to spend about 30% of their time doing operations work, and operations people doing development. Operations and Development staff work on the same team, use the same deployment and source control repositories and report to the same manager. Design responsibility is shared between the groups. It seems like a pretty good idea. I've always thought the high wall between development and operations is a serious problem, leading to inefficiency and often, downright hostility. (Does development test anything first? If operations would let me attach a debugger to the process or just login to the damn server I could have solved this already, Argh! ...etc)
Mark
As for small systems that don't share much state... we did a fair bit of that on Dungeon Runners (auth, part of our chat services, guilds, database and database cache, etc.) being individual services that are used if they're online and worked around if they're offline. It was kind of a mixed bag. Not crashing because those services are down is big, but at the same time it's easy to share too little state and build a bad user experience.
For example, it was a long battle getting character renames, deletes, and delete/recreate-with-same-name to work properly across all the services that needed it. It's easy to let them get out of sync, but it's hard to get the distributed transactions and synchronization right.
You also run into problems with game rules that rely on those external services... for a children's game, for example, you might need to integrate chat with whatever service manages friends, as well as groups or guilds, so that you can only chat with people you "know." It seems you're left trying to replicate and cache state held by other services quite a bit, and that leads to more opportunities to get the world in an inconsistent state.