Software engineering is not a new field. Best practices on testing are mature now, and Anthropic has poached enough engineers from companies with a solid understanding of those practices.
Yet, their flagship product got three really bad changes shipped into it and only resolved after more than a month.
This raises another question: with all the industry-wide boasting about AI-driven productivity, why does the leading company in agentic coding take over a month to fix severe customer-reported issues?
> Why does it take the company that is probably the best at agentic coding more than a month to find and solve such large regressions, even with customers complaining about them?
My unfounded suspicion: because this is the tradeoff we're all facing and for the most part refusing to accept when transitioning over to LLM-driven coding. This is exactly how we're being trained to work by the strengths and limitations of this new technology.
We used to depend on maintaining a global if incomplete understanding of a whole system. That enabled us to know at a glance whether specs and tests and actual behavior made sense and guided our thinking, enabling us to know what to look at. With agentic coding, the brutal truth is that this is now a much less "efficient" approach and we'll ship more features per day by letting that go and relying on external signs of behavior like test suites and an agent's analysis with respect to a spec. It enables accomplishing lots of things we wouldn't have done before, often simply because it would be too much friction to integrate it properly -- write tests, check performance, adjust the conceptual understanding to minimize added complexity, whatever.
So in order to be effective with these new tools, we're naturally trained to let go of many of the things we formerly depended on to keep quality up. Mistakes that would have formerly been evidence of stupidity or laziness are now the price to pay for accelerate productivity, and they're traded off against the "mistakes" that we formerly made that were less visible, often because they were in the form of opportunity cost.
Simple example: say you're writing a simple CLI in Python. Formerly, you might take in a fixed sequence of positional arguments, or even if you did use argparse, you might not bother writing help strings for each one. Now because it's no harder, the command-line processing will be complete and flexible and the full `--help` message will cover everything. Instead, you might have a `--cache-dir=DIR` option that doesn't actually do anything because you didn't write a test for it and there's no visible behavioral change other than worse performance.
Closely related, what do you do with user feedback and complaints? Formerly they might be one of your main signals. Now you've found that you need dependable, deterministic results in your test suite that the agent is executing or it doesn't help. User input is very very noisy. We're being trained away from that. There'll probably be a startup tomorrow that digests user input and boils out the noise to provide a robust enough signal to guide some monitoring agent, and it'll help some cases, and train us to be even worse at others.
> you might have a `--cache-dir=DIR` option that doesn't actually do anything
Working in enterprise software it's surprising how long an option that doesn't actually do anything can be missed. And that was before AI and having thousands of customers use it.
This same problem happens with documentation all the time. You end up with paragraphs or examples that simply don't reflect what the product actually does.
Where I work, options that don't do anything are seen as good engineering practice. You see, you can't break your user's scripts. Your CLI arguments are part of your stable API. If your tool used to have a cache_dir CLI option, and now no longer needs it, you still have to keep accepting cache_dir and treat it as a no-op until you are confident your users have migrated away from it.
I've been working on this problem coming from the program synthesis school of thought over at https://promptless.ai (which you would have no clue just from looking at the website because its targeted at tech writers).
I'm quite fond of the idea of incremental mutation of agent trajectories to move/embody some of the reasoning steps from LLM tokens into a program. Imagine you have a long agent transcript/trajectory and you have a magic want to replace a run of messages with "and now I'll call this script which gives me exactly the information I need," then seeing if the rewritten trajectory is stable.
To give credit where it's due, it's an overly complicated restatement of what Manny Silva has been saying with docs-as-tests https://www.docsastests.com/. Once you describe some user flow to humans (your "docs"), you can "compile" or translate part or all of those steps into deterministic test programs that perform and validate state transitions. Ideally you compile an agent trajectory all the way.
So: working with coding agents, you've cranked up the defect rate in exchange for speed, lets try testing all important flows. The first thing you try is: ok, I've got these user guides, I guess I'll have the agent follow along and try do it. And that works! But it's a little expensive and slow.
So I go, ok I'll have the agent do it once, and if it finds a trajectory through a product that works, we can reflect on that transcript and make some helper scripts to automate some or all of those state transitions, then store these next to our docs.
And then you say, ok if I ship a product change, can I have my coding agent update those testing scripts to save the expense and time of re-running the original follow-along. Also an obvious thing to do, and you can totally build it yourself with Claude Code in a github action. But I think there is a lot of complexity in how you go about doing this, what kind of incremental computation you can do to keep the LLM costs of all this under a couple hundred bucks a month for teams shipping 20 changes a day with 200 pages of docs.
The most polished open source "compiler/translator" I've seen exploring these ideas so far is Doc Detective (https://doc-detective.com) by Manny.
I am not sure this approach can take you very far.
In my experience, CC makes it very very easy to _add_ things, resulting in much more code / features.
CC can obviously read/understand a codebase much faster than we do, but this also has a limit (how much context we can feed into it) - I think your approch is in essence a bet that future models' ability to read/understand code (size of context) improves as fast or faster than the current models' ability to create new code.
Ouch. I guess this came across as "my approach". I haven't done enough agentic coding to feel like I know enough to have a worthwhile, but at the moment I'm squarely in your camp. I don't believe it's going to work to let an agent loose expanding a teetering codebase with little to no concern for maintainability. We're going to have to painfully relearn the lessons of pre-AI coding, whatever that means with AI in the mix.
> Closely related, what do you do with user feedback and complaints? Formerly they might be one of your main signals. Now you've found that you need dependable, deterministic results in your test suite that the agent is executing or it doesn't help. User input is very very noisy.
I don't even use Claude and it has been rather clear to me, that their service has not been working properly for some time now.
> digests user input and boils out the noise to provide a robust enough signal to guide some monitoring agent
not to sound uncharitable but this seems like the absolute worst way to run a business; your customers are basically lab rats... why should they pay for anything in this scenario?
I just said someone's gonna build it, not that it's a good idea!
To be fair [to myself], this is scale-dependent. I work on a product with hundreds of millions of users. We're not going to be reading and pondering every bit of feedback we get. We have automation for stripping out some of the noise (eg the number of crash reports we get from bit flips due to faulty RAM is quite significant at this scale). We have lines of defense set up to screen things down -- though if you file a well-researched and documented bug, we'll pay attention. (We won't necessarily do what you want, but we'll pay attention.)
When I worked at a much smaller and earlier stage company, we begged our users for feedback. We begged potential users for feedback. We implemented some things purely to try to get someone excited enough that they would be motivated to give feedback.
Anthropic, OpenAI, Google? They have a lot of users.
Also, this automation would be in addition to the other channels by which you'd pay attention to feedback.
Also also, the ship has sailed. We're all lab rats now. We're randomly chosen to be A/B tested on. We are upgraded early as part of a staged rollout. We're region-locked. Geocoded. Tracked as part of the cohort that has bought formula or diapers recently. Maybe we live in the worst of all possible worlds?
>There'll probably be a startup tomorrow that digests user input and boils out the noise to provide a robust enough signal to guide some monitoring agent, and it'll help some cases, and train us to be even worse at others.
My theory is that most problem solvers are bad at solving problems, and most managers are bad at managing, and it doesn't matter how evolution created them: They'll make mistakes, they'll have finite time and energy, a finite context window, they'll lie and internally rewrite their own internal narratives as needed, and forget things, and drop balls, and they'll go in circles trying to find a bug they created but are too close to be able to see, and they're going to need a lot of external tooling to get through the day without forgetting anything, and constant reminders from others to get shit done. And this dynamic fundamentally creates peaks and valleys in productivity.
Wait, were we talking about humans or AI?
...
Everyone seems to be assuming either the humans or the AI has to be special. What if neither are?
models are great but models don't magically fix things. you need to set up systems to handle the output of code, you need to instrument metrics to llm to listen to and flag. experimentation is a huge problem, with the huge output of code, how to you keep your business metrics clean and isolate issues. these are all hard challenges.
in response, most companies are explicitly trading velocity for quality, and finding out that quality is actually important at the end of the day. if you look at the roadmap it's just ship ship ship. eng is being told to 3x their output. quality in the llm coded world is tough and there's not much appetite for it right now.
Yet, their flagship product got three really bad changes shipped into it and only resolved after more than a month.
This raises another question: with all the industry-wide boasting about AI-driven productivity, why does the leading company in agentic coding take over a month to fix severe customer-reported issues?