I touched a little bit on continuous deployment in talking about Craftsmanship, but got sidetracked into a discussion of why downtime matters even if it "doesn't matter." There are two big technical hurdles to continuous deployment, in my opinion. The first is having sufficient automated test coverage that you're confident, every time you publish, that the bugs are in your design and not in the rest of your code. Let's ignore that one for a moment, because the second hurdle - constantly bringing servers down to update them, disrupting your service - seems to have further caught the fancy of one Bryant Durrell, former director of operations at Turbine.
He doesn't like the idea of continuous deployment, but he loves the idea of minimizing downtime to nothing (who doesn't?). On the other hand, I think the process he describes could use some work. There are a few things that really jump out at me, ways to minimize stress and downtime without even pushing that hard. The first is get rid of the deployment checklist, as quickly as you can. Replace that checklist with a button to push. If you miss doing some things when the button is pushed, if you discover corner cases the automated publish doesn't handle, you'll notice - and you'll notice your mistakes faster when your expectation is "push button, see publish" than when you are constantly having to re-evaluate in your head where you are in the process, what's left, and at what point you see results.
The other issue I see in Bryant's posts are the discussion of rollbacks. I'm not sure there's a single thing more heinous, from a player's perspective, than a rollback. It's a terrible error when a bug causes a crash before players' data can be saved, I don't think we should be planning, as developers, to instigate extra rollbacks. I don't have any specific advice that comes to mind here, but I think that if you're resolute in avoiding rollbacks, the way to get around having to rollback after a bad publish will open. Sorry that's not very helpful. :)
On to the meat of minimizing downtime... I think the biggest issue is really just a question of load balancing. All kinds of other networked systems are designed to cluster in such a way that individual machines can go down, without the entirety of the service being affected.
Game servers have more state than most, but on the other hand we're talking about planned downtimes here, not disaster recovery (although that's worth thinking about too). If you design the entire system around short-lived processes on a cluster, for example, then it should be possible to label an individual machine as being no longer in the pool of available machines while the process updates. Of course, there you probably run into an issue of data storage, but you can design around that as well - make it easy for an older version of the process to ignore new information in the serialized version of the player character, for example.
If you have longer-lived processes, a stronger hand-off procedure would be needed. However, given gigabit interconnects, it seems reasonable that a server about to be updated could contact a hot spare, convey its current state, and transfer data ownership in a very brief span of time. Once a server has relinquished control of any player state, it could be safely updated and restarted, ready to serve updated clients.
Another option seems to be faking "zero downtime" - go ahead and force everyone to restart their client and reconnect when the server they're on is updated, but do it as a rolling update (so the majority of your playerbase can continue to be online at any given moment), make it as easy as "I'm disconnected, now I reconnect and the game is updated."
Tuesday, April 21. 2009
LOGIN Conference
I've been dragging my feet on getting the necessary approvals to go to LOGIN, and now I'm kind of glad: I don't have the exact date yet but I'll be attending a memorial service for my uncle around that time instead. It might end up being the week before LOGIN, but I don't think that makes a second week away from home particularly more appealing.
Maybe next year... and I look forward to seeing a lot of the same people (and possibly some of the same presentations!) here in Austin in the fall.
Maybe next year... and I look forward to seeing a lot of the same people (and possibly some of the same presentations!) here in Austin in the fall.
Monday, April 20. 2009
Blog migration progress
I'm happy with how comments are handled now. HTML is allowed, but HTML tags and attributes are whitelisted; if I see problems I'll restrict it further, but this lets comments I imported from LiveJournal display reasonably, and is the least surprising kind of markup I could use. I had to write my own plugin (leaning heavily on a pre-existing HTML parser, HTML Purifier: I'm not stupid enough to my own hand at parsing HTML); if it continues to work out pretty well for me I'll submit it to the s9y repository. Of course, I'm not exactly expecting all the WordPress-loving bloginati (who may very well already be using an HTML Purifier plugin!) to care.
Other than that, work continues with nothing I can discuss. :-)
Other than that, work continues with nothing I can discuss. :-)
Tuesday, April 14. 2009
Humor me
I'm still messing around with comments here. If anyone would play around with the comment box and give me feedback about it, I'd appreciate it. Have a preferred markup system (BBcode, Wiki, Markdown, HTML, etc.)? Let me know.
Sunday, April 12. 2009
New Blog!
Go ahead and leave feedback about how much you hate the new site, hated the old site, hate me, etc. here. I look forward to it. :-)