Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I can appreciate the effort put into the goal of optimization shared in the post, even if I disagree with the conclusions. All of that effort would be much better directed at doing a manual (or LLM-assisted) audit of the E2E tests and choosing what to prune to reduce CI runtime.

DHH recently described[0] the approach they've taken at BaseCamp, reducing ~180 comprehensive-yet-brittle system tests down to 10 good-enough smoke tests, and it feels much more in spirit with where I would recommend folks invest effort: teams have way more tests than they need for an adequate level of confidence. Code and tests are a liability, and, to paraphrase Kent Beck[1], we should strive to write the minimal amount of tests and code to gain the maximal amount of confidence.

The other wrinkle here is that we're often paying through the nose in costs (complexity, actual dollars spent on CI services) by choosing to run all the tests all the time. It's a noble and worthy goal to figure out how not to do that, _but_, I think the conclusion shouldn't be to throw more $$$ into that money-pit, but rather just use all the power we have in our local dev workstations + trust to verify something is in a shippable state, another idea DHH covers[2] in the Rails World 2025 keynote; the whole thing is worth watching IMO.

[0] - https://youtu.be/gcwzWzC7gUA?si=buSEYBvxcxNkY6I6&t=1752

[1] - https://stackoverflow.com/questions/153234/how-deep-are-your...

[2] - https://youtu.be/gcwzWzC7gUA?si=9zL-xWG4FUxYZMC5&t=1977



Agreed. When you have multiple developers working on the same code, you end up with overlapping test coverage as time goes on. You also end up with test coverage that was initially written with good intentions, but ultimately you'll later find that some of it just isn't necessary for confidence, or isn't even testing what you think it is.

Teams need to periodically audit their tests, figure out what covers what, figure out what coverage is actually useful, and prune stuff that is duplicative and/or not useful.

OP says that ultimately their costs went down: even though using Claude to make these determinations is not cheap, they're saving more than they're paying Claude by running fewer tests (they run tests on a mobile device test farm, and I expect that can get pricey). But ultimately they might be able to save even more money by ditching Claude and deleting tests, or modifying tests to reduce their scope and runtime.

And at this point in the sophistication level of LLMs, I would feel safer about not having an LLM deciding which tests actually need to run to ensure a PR is safe to merge. I know OP says that so far they believe it's doing the right thing, but a) they mention their methodology for verifying this in a comment here[0], and I don't agree that it's a sound methodology[1], and b) LLMs are not deterministic and repeatable, so it could choose two very different sets of tests if run twice against the exact same PR. The risk of that happening may be acceptable, though; that's for each individual to decide.

[0] https://news.ycombinator.com/item?id=45152504

[1] https://news.ycombinator.com/item?id=45152668




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: