Hmm, on your Google point, we know that they use partial-cluster deployments ext...

mechanical_fish · on Feb 10, 2009

I agree that we shouldn't extrapolate too much from this one incident. But it's not like Google's super-secrecy policy gives us much choice. If anyone from Google wants to tell us about their deployment infrastructure and explain why this one incident really was a nigh-impossible black-swan one-in-one-billion-hour freak of nature -- or why Google has sensibly traded away a certain amount of uptime in exchange for a more flexible architecture (or, perhaps, more cash to spend on tasty gourmet pizzas) -- I'm sure we'll all listen with rapt attention. Until then, we get to tease them mercilessly. ;)

Meanwhile, I'm sure that the original submitter would agree that tests ain't perfect. If you read the link at the top of this blog post:

http://timothyfitz.wordpress.com/2009/02/08/continuous-deplo...

...you'll find that this isn't merely an article about automated testing. Automated testing is just a part of the mighty continuous-deployment ecosystem being described here. It isn't even the real heart of that system: The heart is a planned, well-designed, semi-automated routine for rolling back changes in production. They roll out a change to a subset of their servers, monitor for statistical anomalies in the usage patterns of real, live users, and only continue the rollout if there are no anomalies. If they run into trouble, back they go.

jeremyw · on Feb 10, 2009

I don't want to defend Google per se, but their uptime results speak for themselves. I don't see how a rare bug necessitates mocking.

And I agree about resiliency of the deploy -- it's what I meant by sophisticated testing of these momentary guinea pig users. Google's presentations on this stuff are about analysis and data gathering of changes both for immediate functional snafus and user preference for changes. i.e. probably state of the art in this regard.