Don't be too disappointed if a single submission gets a lukewarm or confused response on HN. The upmods and comments on here are a lot less consistent than what you're used to. ;) Just keep writing. It's really valuable.
Also, it's clear to me why your daily routine might sound like science fiction to the median HN reader: A lot of programmers have never seen a system like this. As those of us who were online during a specific half-hour period a couple weeks ago can attest, even Google doesn't have a system that's remotely as reliable as this: It appears to be possible to break all of Google search, worldwide, in ten minutes by misplacing a single character in a text file.
I have often posed questions here about things I've been doing for years just to see what others are doing or if there is a better way. Invariably, someone tells me it won't work when I already know better.
This tells me 2 things. I've encountered someone who speaks when they should be listening (what else is new), and, more importantly, that I'm pushing the envelope enough to make otherwise knowledgeable people uncomfortable. Good.
Hmm, on your Google point, we know that they use partial-cluster deployments extensively, and several presentations point to sophisticated testing of these momentary guinea pig users. I wouldn't hold a one-time lack of a sanity check against their total uptime history. Tests ain't perfect.
I agree that we shouldn't extrapolate too much from this one incident. But it's not like Google's super-secrecy policy gives us much choice. If anyone from Google wants to tell us about their deployment infrastructure and explain why this one incident really was a nigh-impossible black-swan one-in-one-billion-hour freak of nature -- or why Google has sensibly traded away a certain amount of uptime in exchange for a more flexible architecture (or, perhaps, more cash to spend on tasty gourmet pizzas) -- I'm sure we'll all listen with rapt attention. Until then, we get to tease them mercilessly. ;)
Meanwhile, I'm sure that the original submitter would agree that tests ain't perfect. If you read the link at the top of this blog post:
...you'll find that this isn't merely an article about automated testing. Automated testing is just a part of the mighty continuous-deployment ecosystem being described here. It isn't even the real heart of that system: The heart is a planned, well-designed, semi-automated routine for rolling back changes in production. They roll out a change to a subset of their servers, monitor for statistical anomalies in the usage patterns of real, live users, and only continue the rollout if there are no anomalies. If they run into trouble, back they go.
I don't want to defend Google per se, but their uptime results speak for themselves. I don't see how a rare bug necessitates mocking.
And I agree about resiliency of the deploy -- it's what I meant by sophisticated testing of these momentary guinea pig users. Google's presentations on this stuff are about analysis and data gathering of changes both for immediate functional snafus and user preference for changes. i.e. probably state of the art in this regard.
> As those of us who were online during a specific half-hour period a couple weeks ago can attest, even Google doesn't have a system that's remotely as reliable as this: It appears to be possible to break all of Google search, worldwide, in ten minutes by misplacing a single character in a text file.
How can we conclude that the same isn't true of IMVU? The fact that such a rare event hasn't happened to them yet tells us very little.
I definitely wasn't disappointed. In fact it was refreshing; I didn't realize how hard of a problem other people considered Continuous Deployment (and in fact, how little other people had considered it at all).
I also didn't mean to rag on news.yc, on average the responses here are better than the original posts which is an unparalleled level of quality.
I left the technical issues unspecified in my first post on Continuous Deployment, and the comments on that post already had started discussing the path of solutions we ended up building ourselves!
I still have another post to write about this, because we also ship a native windows client. We ship daily prereleases of it, and roughly biweekly full releases (offered to all users). That's close to, but not quite as impressive, as the update system that google chrome uses. Google definitely still has us beat in certain categories.
One of the greatest lines I have ever read on a blog:
It may be hard to imagine writing rock solid one-in-a-million-or-better tests that drive Internet Explorer to click ajax frontend buttons executing backend apache, php, memcache, mysql, java and solr. I am writing this blog post to tell you that not only is it possible, it’s just one part of my day job.
The part you're missing is the 15000 tests, multiplied by a new commit every 9 minutes, which in 8 working hours, is roughly 50 commit-test cycles, so 750,000 tests run in a day's timespan...
Edit: of course that assumes a peak commit rate matching or exceeding the commit-test cycle period. The point being that even a considerably low rate of failure in the testing mechanism could manifest itself as a blocked commit-test-deploy cycle at least once a day, hence the importance placed on rock-solid testing systems that should only ever fail when the tested code itself fails.
We empirically have on average 70 builds a day. The number is higher than your calculation because we don't all work 9-5, we're commiting frequently from around 8am to 9pm. We also run builds repeatedly overnight to flush out any intermittently failing tests we may have recently introduced. We'll run the builds as fast as they can go from 2am-4am.
So how often does a commit get checked in that causes a test (or tests) to fail?
It just seemed to me like you were bragging that tests get run over and over again. They only need to get run if any new code is committed, of course.
And what kind of commit is being checked in every 9 minutes? How big is the dev team? Seems like an awful lot of commits. Is each one a full-fledged feature / bug fix for the site, or are many 1-line changes to the code?
As I read it, the issue he's talking about is that it's easy to accidentally write a test that passes the first 100 000 times you write it, but then fails the next time because of a timeout that was set too low or something like that. A test like that can waste a lot of your time tracking down a nonexistent bug.
It's true that any particular test that spuriously fails one in a million times may never fail. But if you have tens of thousands of tests, and you do tens of test runs per day, you'll have a test spuriously fail once a day or so.
When I say reliable, I don’t mean “they can fail once in a thousand test runs.” I mean “they must not fail more often than once in a million test runs.” We have around 15k test cases, and they’re run around 70 times a day. That’s a million test cases a day. Even with a literally one in a million chance of an intermittent failure per test case we would still expect to see an intermittent test failure every day.
He means that test failures are not acceptable in their culture, and they have a LOT of tests.
In the article it talks about tests that fail only one in a million times (referring to the quality of the test, not the test case being exercised), I think that is what is meant.
i am a native speaker and it's not entirely clear to me either but i would feel safe in venturing to guess that it's a reference to the false positive rate
These are automated functional tests. It would be interesting to know if the author uses unit test too and how he writes tests in general. Also some code:test ratio metrics would be indicative of how much weight does all this testing add.
I work at IMVU as well, know the author personally, and can tell you that yes, he unit tests too. In fact, like all of us, writes all code test-first.
For test strategy in general, you need a whole seminar. There's a good blog on testing over at http://googletesting.blogspot.com/ <- go there!
As for code:test ratio, I'm not sure what metric you want, but using just php files, there are ~1k active test files and ~7k php files total. So if you take into account that 1k of that 7k is the test, and some is test-setup and obsolete files we no longer use, it's maybe about 5:1 code:test. However, a lot of the code is third party OS software packages. In-house code is about 1.5:1 test:code, in my experience. These, by the way, are very rough estimates.
I tell people that we should aim for this sort of automation and they pat me on the head and say, "No, no, that will never do."
I think there's an idea that if something goes wrong because you let an automated system do it, it's somehow much worse than if something goes wrong because there was human error. I don't really understand the reasoning.
Exactly. Drew Perttula put it better than I'll be able to:
"IMHO, manual testing has only two advantages: it’s the easiest thing to [try to] do; and it has a lovely accountability chain. You can always blame the developer, and non-technical people will easily accept that this is the “inevitable cost of software engineering”."
The thing I'd be really interested in is how you deal with UI changes - I've never found a satisfactory way to test "is this ugly/confusing" other than letting a few users bang on it on a staging server.
The ultimate solution is to have business metrics drive your UI changes, usually in the form of an A/B test. Then you have a clear winner. This A/B would be run separate from the roll out structure (and indeed, we do LOTS of A/B tests).
Sometimes that's not possible, for a new feature or for content without a clear business metric to evaluate for. Either way we often have someone manually test new UI, so that we're not exposing users to something fundamentally broken. We usually do this by using the existing deploy system, but turning the frontend on only for QA users.
In the end, you do what works and is cheap, and that's usually something slightly different for every project.
There's a difference between automation and how often your customers see something going wrong. I'm all for the automation. But lets say the error rate is so low that just 0.1% of automated releases go wrong. Rolling out 50 times a day means you'll expose an error every 20 days. Compare that to a monthly, weekly or even daily cycle and you can see you're exposing yourself and your customers to problems without much corresponding gain.
From what I have read Facebook use a similar method: commit and deploy often and rollback if something messes up. We also use this method on Plurk.com and have done so for about a year. Thought, IMVU's case is pretty extreme :)
The major problem is rolling back client side changes (that are located in scripts or CSS). This is pretty costly to rollback, because of browser cache - we solve this by having real versioning of the static files so we can force a refresh of browser cache (real versioning = script_{timestamp}.js and not script.js?v={timestamp}).
I've read your post on unit tests, and I didn't understand what you were trying to say.
Were you saying don't write automated tests that test your code, instead focus on monitoring the actual production invironment?
Or were you saying that specifically the "unit test" class of automated tests are not worth their time?
I can imagine a system that monitors the business metrics well enough to prevent defects from slipping into production (it's a stretch, metrics are soft and squishy moving targets), but I can't imagine using only those metrics to find every bug you ever slip into production. Metrics are so distant from the bug that caused their downturn; you'd waste so many cycles debugging. The gap between writing the code and finding the problem would be much larger than if unit tests found them; that has to slow things down as well.
- Monitoring the production environment, tons of effort. We record and analyze an incredible amount of data about everything that happens on the site, and have more and more automated processes looking for anomalies (though still nowhere near as many as I would like).
- Automated testing not including unit tests, some effort. I wouldn't be opposed to us doing more of this, but it's not incredibly high-priority and there always seems to be something else that's more important.
- Unit testing, yeah, not worth our time as far as I'm concerned.
I've worked at places that have done this sort of thing too. We basically dropped css and js files in new directory with the svn rev number which made it very easy to deploy and break through the client side cache.
That makes sure the code is consistent if the user refreshes the webpage or just visits it the first time. But what happens if the user just keeps the AJAXy web-page open for hours (as I do with Gmail for instance)? If you deploy too often and both frontend+backend code are in flux, you're more likely to end up with an inconsistent code state.
I guess you could make the frontend code aware of the code version, include it as a param with each XHR request, have the server check versions and return a "version mismatch", and then produce some alert on the browser asking to refresh the page. But this would tradeoff far too much usability.
Last time we ran into this one, we made sure as much page state as possible was pushed into the fragment portion of the URL (for bookmarkability as much as anything else).
Then when the AJAX stuff saw a version mismatch, it would wait until the user completed any operation that -wasn't- stored in the fragment and put up an "updating, gimme a sec" box, and refresh itself.
It was a hell of a lot of work but -extremely- slick (which I'm allowed to say because it wasn't me who wrote that part ;)
The major advantage to using "script.js?v={timestamp}" is that it maintains a consistent URI for the resource. Whereas with "script_{timestamp}.js", everything that points to it needs to be updated every time it changes.
You could create a symbolic link or rewrite rule that directs requests for "script.js" to the latest "script_{timestamp}.js" but it's more convenient to use a URI parameter.
The problem with script.js?v={timestamp} is that it's ignored by some browsers while script_{timestamp}.js isn't. And with script.js?v={timestamp} you can't set good cache headers.
Also, if you ever move to a CDN, then you are forced to use real versioning (at least with Amazon Cloudfront).
The versioning scheme we use is `md5 hash of name + file contents + file extension` (and not timestamp).
I'm not aware of a browser that ignores URI parameters.
Moving to a CDN does not force you to put versioning in the path or filename. The URI parameter merely tricks the browser into thinking there is a new file. The parameter itself is otherwise ignored.
Unless you specify "Cache-control: no-cache" header you aren't really sure how the browser caches your static files (especially if the user is behind a proxy - and even "Cache-control: no-cache" can easily be ignored).
deploy/rollback is probably ok for a consumer site. But not everything is a public website (no really...) - if you're deploying a service with an SLA with dollar penalties for downtime you might want to stick to a more traditional release cycle. I sure hope the phone network, the stock exchange and my bank aren't using deploy/rollback and releasing 50 times a day!
I think the original article misguided some people. It all looked very simple, update the code and put it in production. That _is a horrible idea_, as some have noted.
What's not horrible is having thousands of tests, on dozen of machines, 9 minutes to-live, with selective updating of users, and rollbacks, as this article has explained.
The original post was too light on details, I guess. Its intention was not to be comprehensive anyway, the focus was why recently changed code should be put in production ASAP. But it looked like the author was simply FTPing after commit. And the whole "SOMEONE IS WRONG ON THE INTERNET" thing kicked in.
I'm also one of the developers on a hobby project called http://TIGdb.com (Jeff Lindsay is the other, and has written the majority of the website) We don't have a big Continuous Deploy infrastructure, but we also don't have the users and business requirements of IMVU.
We started with the usual, completely manual deploys and hard-to-setup sandboxes, and have been iterating towards a fully automated setup ever since. The entire time we've been doing this, we've been committing and deploying often. Our users are patient, because we're giving them something they can't get elsewhere and we're giving it to them for free. As we do introduce regressions, we'll post-mortem them (probably using the 5 why's technique) and we'll slowly evolve a system to prevent regressions. If the site is a success, we'll have evolved a world class deploy system. If the site never makes it that big then we won't have wasted time on infrastructure. It's classic lean startup thinking (even though TIGdb is really just a hobby project).
Just curious - who maintains the Selenium tets, and how big is the development / "QA" team?
I've never worked in a team big enough that it could devote resources to maintaining all of the following kinds of tests:
* unit
* functional
* AND acceptance
* plus writing the actual code
IMHO, a neutral third-party group like QA should be responsible for writing & maintaining acceptance tests.
Looks like meeting that goal would constrain you to write code to be used by a robot and not by a human. There may be many cases where this is both doable and acceptable to the end user. So no problem with that.
I am greatly challenged to see how this could be done for a highly interactive, visually oriented, subtle pattern generating response to user input, type application. Computers are still not as bright as earth worms when it comes to generalized pattern recognition. Which means we programmers are about as bright as earth worms when it comes to writing such code.
How then could computers automatically test all the software reactions to the wonderful and totally unpredictable behavior of mere humans as they interact with your software? The test cases would expand to consume all the resources available for development. All you would get done is writing all but impossible test cases. At least you wouldn't ship bugs.
This does not consider the explosion of combination and permutations of inputs that prohibits exhaustive testing that no matter how many systems you run tests on.
It would be much easier and cheaper to go out of business. Your certainty of being free of shipped bugs would be much better than one in a million.
As I noted elsewhere in this thread (http://news.ycombinator.com/item?id=475391), this article is not merely about automated tests. The author says that his company is using continuous deployment because it lets live, human end users bang on the code, as quickly as possible, in bite-sized chunks that can more easily be rolled back and fixed.
Why not have your local tests, automated or not, cover the common cases and error conditions to catch programmer stupidities? Then let the actual humans do the strange corner cases.
If your design is even close to correct, testing repeatedly tested code is pointless. If your design is corrupt and your implementation is sloppy, no amount of testing is going to save your ass.
I do very rapid turns and I am a one man team. I can turn my system in less that 30 minutes and have the user testing it in a live situation on the other coast. If I want 10 turns a day, I can easily do it. Low coupling, high cohesion, clean correct design, and disciplined implementation makes it possible.
I agree that doing things in small chunks is a great way to do it but doing the equivalent of a weeks worth of global automated testing for each small change seems like a silly exercise. That is except for the server hardware salesmen and system admin people.
The sales commissions and payroll look rather good. The production of real value is questionable. Bang for the buck is as important for testing as it is in any other part of product development.
I think the source of your confusion is that you're a one man team. You don't have to solve problems that 20-man teams have to face. At least half of all the code I depend on is code I do not understand, so I have to depend on its tests, and I have to make the same promise to consumers of my code. If my change breaks code someone else wrote that I didn't foresee, I am depending on his tests to tell me what I screwed up.
Maybe the problem is that you have the 20-man team. There is no coherence in the code. The design is wrong, coupling is too high, and the module cohesion is too low. The large team makes certain that is the case no matter how "tight" (aka heavy) your quality control process.
I have found from working in large teams, there is a core four who get things done. The rest are simply dead weight dedicated to shuffling paper and attending meetings. At best, they do nothing. At worst they create more work than they do.
Use the right four and dump the other sixteen. You will get at least ten times more productivity and ten times higher quality without even breaking a sweat. If you don't have the right four, you are hosed from the start.
This works fine if you are tackling a problem that can be sufficiently addressed by 4 developers. Depending on the size and scale of the problem you are trying to suggest and the time line required for delivery you may need a larger team.
When you begin to take that into account you realize you have to find ways for the larger team to work together and still produce a quality product. Hence the techniques being used by the author and other companies out there trying to address similar problems.
"...you have to find ways for the larger team to work together and still produce a quality product. "
I am not sure its possible. The communication overhead of so many linkages forces incoherence. The resultant incoherence forces still more additions to process and body count. That adds still more communication overhead. The result is still more incoherence - not less. If something is "finished", its simply because time, money, resources, and toleration ran out. The end result was simply called "done".
Maybe that is the best we can do but I am hard pressed to call products produced that way quality products. See Vista et.al. for instructive detail.
Whoa, I love the sound of this as far as development process... But what really blew me away is: 3D Chat makes $1M/month? Really? Or did I find the wrong IMVU?
I would love to see that post. The first question most people ask me is "How do I get there?" and I don't have a great place to point and say "start here"
A well written concise introduction to continuous integration / constant testing would be a boon to this community.
I don't recall the last time an article linked from Hacker News so quickly and dramatically expanded my notion of what is possible in software development. Bravo!
Continuous deployment is good, but the comments are valid.
There is a certain non-zero probability for errors to occur during deployment. Binaries have to be reloaded, database connections have to be reconnected, sessions have to be restored, etc, so the more you deploy, the larger the coefficient before this probability in the "will something go wrong" equation.
So, what we do is break up our system into deployment groups where some handful of users gets updated a few times an hour sometimes. We test the deployment on this small set of users, usually they know the change is coming and are ready to test the change in real time.
Sometimes we repeat this process using different deployment groups. Test in this one, then test in that one, until we get a final small errorless deployment and then we roll out to the masses.
If it is successful, we roll it out to the masses.
Your site doesn't have to be /all/ beta or /all/ production. You can have batches of users in different groups.
Any chance of more detail than you are giving in your posting ? This is extremely interesting stuff, I'd really like to know a lot more about what goes in to achieving this.
No, seriously, I'd be much obliged if you could tell what tools go into your setup, how much of it is created in house - and thus unavailable - and how much of it is off the shelf, preferably open source. I'd very much like to spend time on recreating what you've done there.
I've written in light detail about this in a few places; I'd be glad to share more. Here's an assortment off the top of my head. Feel free to ask anything else you'd like to know.
Whether you commit your code once in one batch at the end of the day or 50 times in 50 smaller chunks, you have the same amount of complexity about which to be careful. In fact it's more complex in the former case, because in the latter, for each push, you know that all the previous pushes are working.
When something breaks in production, it's easier to figure out what it was and fix it if you only changed a few things since the last time you updated production.
Also, it's clear to me why your daily routine might sound like science fiction to the median HN reader: A lot of programmers have never seen a system like this. As those of us who were online during a specific half-hour period a couple weeks ago can attest, even Google doesn't have a system that's remotely as reliable as this: It appears to be possible to break all of Google search, worldwide, in ten minutes by misplacing a single character in a text file.