Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hey, Boris from the Claude Code team here. A few tips:

1. If there is anything Claude tends to repeatedly get wrong, not understand, or spend lots of tokens on, put it in your CLAUDE.md. Claude automatically reads this file and it’s a great way to avoid repeating yourself. I add to my team’s CLAUDE.md multiple times a week.

2. Use Plan mode (press shift-tab 2x). Go back and forth with Claude until you like the plan before you let Claude execute. This easily 2-3x’s results for harder tasks.

3. Give the model a way to check its work. For svelte, consider using the Puppeteer MCP server and tell Claude to check its work in the browser. This is another 2-3x.

4. Use Opus 4.5. It’s a step change from Sonnet 4.5 and earlier models.

Hope that helps!





> If there is anything Claude tends to repeatedly get wrong, not understand, or spend lots of tokens on, put it in your CLAUDE.md. Claude automatically reads this file and it’s a great way to avoid repeating yourself.

Sure, for 4/5 interactions then will ignore those completely :)

Try for yourself: add to CLAUDE.md an instruction to always refer to you as Mr. bcherny and it will stop very soon. Coincidentally at that point also loses tracks of all the other instructions.


One of the things you get an intuition for after using these systems is when to start a new conversation, and the basic rule of thumb is “always.” Use a conversation for one and only one task or question, and then start a new one. For longer projects, have the LLM write down a plan or checklist, and then have it tackle each step in a new conversation. The LLM context collapse happens well before you hit the token limits, and things like ground rules and whatnot stop influencing the LLMs outputs after a couple tens of thousands of tokens in my experience.

(Similar guidance goes for writing tools & whatnot - give the LLM exactly and only what it needs back from a tool, don’t try to make it act like a deterministic program. Whether or not they’re capital-I intelligent, they’re pretty fucking stupid.)


Yeah, adherence is a hard problem. It should be feeling much better in newer models, especially Opus 4.5. I generally find that Opus listens to me the first time.

Have been using Opus 4.5 and can confirm this is how it feels, it just works.

It also works your wallet

Right now Google Antigravity has free Claude Opus 4.5, with pretty decent allowances.

I also use Github Copilot which is just $10/mo. I have to use the official copilot though, if I try to 'hack it' to work in Claude Code it burns thru all the credits too fast.

I am having a LOT of great luck using Minimax M2 in Claude Code, its very cheap, and it works so good.. its close to Sonnet in Claude Code. I use this tool called cc-switch to swap out different models for Claude Code.


Highly recommend Claude Max, but I also want to point out Opus 4.5 is the cheapest Opus has ever been.

(I just learned ChatGPT 5.2 Pro is $168/1mtok. Insanity.)


If you pay for a Claude Max subscription it is the same price as previous models.

Just wait a few months -- AI has been getting more affordable _very_ quickly

I’ve felt that the LLM forgets CLAUDE.md after 4-5 messages. Then, why not reinject CLAUDE.md into the context at the fifth message?

CLAUDE.md should be picked up and injected into every message you send to the model, regardless if it is 1st or 10th message in the same session.

Yes. One of my system-wide instructions is “Read the Claude.md file and any readme in the current directory, then tell me how you slept.”

If Claude makes a yawn or similar, I know it’s parsed the files. It’s not been doing so the last week or so, except for once out of five times last night.


The number of times I’ve written “read your own fucking Claude.md file” is a bit too numerous.

“You’re absolutely right! I see here you don’t want me to break every coding convention you have specified for me!”


The Attention algo does that, it has a recency bias. Your observation is not necessarily indicative of Claude not loading CLAUDE.md.

I think you may be observing context rot? How many back and forths are you into when you notice this?


That explains why it happens, but doesn't really help with the problem. The expectation I have as a pretty naive user, is that what is in the .md file should be permanently in the context. It's good to understand why this is not the case, but it's unintuitive and can lead to frustration. It's bad UX, if you ask me.

I'm sure there are workarounds such as resetting the context, but the point is that god UX would mean such tricks are not needed.


It’s not that it’s not in the context, it’s that it was injected so far back that it is deemed not so important when determining the next token.

Yeah the current best approach to aggressively compact and recreate context by starting fresh. It’s awkward and I wish I didn’t have to.

I'm surprised this hasn't been been automated yet but I'm pretty naive to the space - the problem of "when?"/"how often?" seems like a fun one to chew on

I think Gemini 3 pro (high) in Antigravity does something like that because I can keep asking for different changes in the same chat without needing to create a new session.

I know the reason, I just took the opportunity of answering to a claude dev to point out why it's no panacea and how this requires consistent context management.

Real semi-productive workflow is really a "write plans in markdowns -> new chat -> implement few things -> update plans -> new chat, etc".


This is cool, thank you!

Some things I found from my own interactions across multiple models (in addition to above):

- It's basically all about the importance of (3). You need a feedback loop (we all do). and the best way is for it to change things and see the effects (ideally also against a good baseline like a test suite where it can roughly guage how close or far it is from the goal.) For assembly, a debugger/tracer works great (using batch-mode or scripts as models/tooling often choke on such interactivie TUI io).

- If it keeps missing the mark tell it to decorate the code with a file log recording all the info it needs to understand what's happening. Its analysis of such logs normally zeroes the solution pretty quickly, especially for complex tasks.

- If it's really struggling, tell it to sketch out a full plan in pseudocode, and explain why that will work, and analyze for any gotchas. Then to analayze the differences between the current implementation and the ideal it just worked out. This often helps get it unblocked.


Hey Boris,

I couldn't agree more. And using Plan mode was a major breakthrough for me. Speaking of Plan Mode...

I was previously using it repeatedly in sessions (and was getting great results). The most recent major release introduced this bug where it keeps referring back to the first plan you made in a session even when you're planning something else (https://github.com/anthropics/claude-code/issues/12505).

I find this bug incredibly confusing. Am I using Plan Mode in a really strange way? Because for me this is a showstopper bug–my core workflow is broken. I assume I'm using Claude Code abnormally otherwise this bug would be a bigger issue.


Yes as lostdog says, it’s a new feature that writes plans in plan mode to ~/.claude/plans. And it thinks it needs to continue the same plan that it started.

So you either need to be very explicit about starting a NEW plan if you want to do more than one plan in a session, or close and start a new session between plans.

Hopefully this new feature will get less buggy. Previously the plan was only in context and not written to disk.


Why don’t you reset context when working on something else?

It’s additional features that are related.

For example making a computer use agent… Made the plan, implementation was good, now I want to add a new tool for the agent, but I want to discuss best way to implement this tool first.

Clearing context means Claude forgets everything about what was just built.

Asking to discuss this new tool in plan mode makes Claude rewrite entire spec for some reason.

As workaround, I tell Claude “looks good, delete the plan” before doing anything. I liked the old way where once you exit plan mode the plan is done, and next plan mode is new plan with existing context.


I get where you're coming from. But you'll likely get better results by starting fresh and letting it read key files or only just a summary of the project goals/spec. And then implement the next feature building up on the previous one. It's unlikely you'll need all the underlying code of the foundation in context to implement something that builds up on it - especially if interfaces are clean. Models still get dumber the more context is loaded, and the usable window isn't all that big, so starting fresh gives best results usually. I try to avoid compaction in any way possible, and I rarely continue the session after compaction, for that reason.

Yes, I've also been confused by things like this. Claude code is sometimes saving plans to ~/.claude/plans under animal names. But it's not really surface where the plan goes, not what the expected way to refer back to them is?

Thank you for Claude Code (Web). Google has a similar offering with Google Jules. I got really, really bad results from Jules and was amazed by Claude Code when I finally discovered it.

I compared both with the same set of prompts and Claude Code seemed to be a senior expert developer and Jules, well don't know who be that bad ;-)

Anyway, I also wanted to have persistent information, so I don't have to feed Claude Code the same stuff over and over again. I was looking for similar functionality as Claude projects. But that's not available for Claude Code Web.

So, I asked Claude what would be a way of achieving pretty the same as projects, and it told me to put all information I wanted to share in a file with the filename:.clinerules. Claude told me I should put that file in the root of my repository.

So please help me, is your recommendation the correct way of doing this, or did Claude give the correct answer?

Maybe you can clear that up by explaining the difference between the two files?


CLAUDE.md is the correct file for Claude.

Do you recommend having Claude dump your final plan into a document and having it execute from that piece by piece?

I feel like when I do plan mode (for CC and competing products), it seems good, but when I tell it to execute the output is not what we planned. I feel like I get slightly better results executing from a document in chunks (which of course necessitates building the iterative chunks into the plan).


Since we released the last major version of Claude Code, Claude writes its plan to a file automatically for that reason! It also means you can continue to edit your plan as you go.

Opus 4.5 seems to be able to plan without asking, but I have used this pattern of "write a plan to an .md", review and maybe edit, and then execution, maybe in another thread,... I have used it with Codex and it works well.

Profilerating .md files need some attention though.


a very common pattern is planner / executor.

yes the executor only needs the next piece of the plan.

I tend to plan in an entirely different environment, which fits my workflow and has the added benefit of providing a clear boundary between the roles. I aim to spend far more time planning than executing. if I notice getting more caught up in execution than I expected, that's a signal to revise the plan.


I ask it to write a plan and when it starts the work, keep progress in another document and to never change the plan. If I didn't do this, somehow with each code change the plan document would grow and change. Keeping plan and progress separate prevented this from happening.

I ask claude to dump the plan into a file and ensure that the tasks have been split into subtasks such that the description of each subtask meets the threshold such that the probability of the LLM misinterpreting is very low.

I often use multiple documents to plan things that are too large to fit into a single planning mode session. It works great.

You can also use it in conjunction with planning mode—use the documents to pin everything down at a high-to-medium level, then break off chunks and pass those into planning mode for fine-grained code-level planning and a final checking over before implementation.


  > I add to my team’s CLAUDE.md multiple times a week.
How big is that file now? How big is too big?

Something to keep in mind is if your CLAUDE.md file is getting large, consider alternative approaches especially for repeatable tasks. Using slash commands and skills for workflows that are repeatable is a really nice way to keep your rules file from exploding. I have slash commands for code review, and git commit management. I have skills for complex tool interactions. Our company has it's own deployment CLI tool so using skills to make Claude Code an expert at using this tool has done wonders to improve Claude Codes performance when working on CI/CD problems.

I am currently working on a new slash command /investigate <service> that runs triage for an active or past incident. I've had Claude write tools to interact with all of our partner services (AWS, JIRA, CI/CD pipelines, GitLab, Datadog) and now when an incident occurs it can quickly put together an early analysis of a incident finding the right people to involve (not just owners but people who last touched the service), potential root causes including service dependency investigations.

I am putting this through it's paces now but early results are VERY good!


Try to keep it under 1k tokens or so. We will show you a warning if it might be too big.

Ours is maybe half that size. We remove from it with every model release since smarter models need less hand-holding.

You can also break up your CLAUDE.md into smaller files, link CLAUDE.mds, or lazy load them only when Claude works in nested dirs.

https://code.claude.com/docs/en/memory


I’ve been fine tuning mine pretty often. Do you have any Claude.md files you can share as good examples? Especially with opus 4.5.

And thank you for your work!! I focus all of my energy on helping families stay safe online, I make educational content and educational products (including software). Claude Code has helped me amplify my efforts and I’m able to help many more families and children as a result. The downstream effects of your work on Claude Code are awesome! I’ve been in IT since 1995 and your tools are the most powerful tools I’ve ever used, by far.


1k tokens, google says thats about 750 words. That's actually pretty short, any chance you could post a few samples of instructions or even link to a publicly available file CLAUDE.md you recommend?

Mine is 24 lines long. It has a handful of stuff, but does refer to other MD files for more specifics when needed (like an early version of skills.)

This is the meat of it:

  ## Code Style (See JULIA_STYLE.md for details)
  - Always use explicit `return` statements
  - Use Float32 for all numeric computations
  - Annotate function return types with `::`
  - All `using` statements go in Main.jl only
  - Use `error()` not empty returns on failure
  - Functions >20 lines need docstrings

  ## Do's and Don'ts
  -  Check for existing implementations first
  -  Prefer editing existing files
  -  Don't add comments unless requested
  -  Don't add imports outside Main.jl
  -  Don't create documentation unless requested
Since Opus 4.0 this has been enough to get it to write code that generally follows our style, even in Julia, which is a fairly niche language.

That is seriously short. I've asked Claude Code to add instructions to CLAUDE.md and my one line request has resulted in tens of lines added to the file.

yes if you tell llm to do things it will be too verbose. either explicitly instruct the length ("add 5 lines bulletpoints, tldr format") or just write it yourself.

Seems reasonable to give Claude instructions to be extra terse.

How do you know what to remove?

also after you have a to-and-fro to course correct it on a task, run this self-reflection prompt

https://gist.github.com/a-c-m/f4cead5ca125d2eaad073dfd71efbc...

That will moves stuff that required manually clarifying back into the claude.md (or a useful subset you pick). It does a much better job of authoring claude.md than I do.


Hah, that's funny. Claude can't help but mess all the comments in the code up even if I explicitly tell it to not change any comments five times. That's literally the experience I had before opening this thread, never mind how often it completely ignores CLAUDE.md.

Hey there Boris from the Claude Code team! Thanks for these tips! Love Claude Code, absolutely one of the best pieces of software that has ever existed. What I would absolutely love is if the Claude documentation had examples of these because I see time and time again people saying what to do in the case you tell us to update the Claude MD with things that it gets wrong repeatedly but it's very rare to have examples just three or four examples of something gets got wrong, and then how you fixed it would be immensely helpful.

Hi Boris,

If you wouldn't mind answering a question for me, it's one of the main things that has made me not add claude in vscode.

I have a custom 'code style' system prompt that I want claude to use, and I have been able to add it when using claude in browser -

``` Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea.

Trust the context you're given. Don't defend against problems the human didn't ask you to solve. ```

How can I add it as a system prompt (or if its called something else) in vscode so LLMs adhere to it?


Add it to your CLAUDE.md. Claude will automatically read that file every time it starts up

Thanks for your work great work on Claude Code!

One other feature with CLAUDE.md I’ve found useful is imports: prepending @ to a file name will force it to be imported into context. Otherwise, whether a file is read and loaded to context is dependent on tool use and planning by the agent (even with explicit instructions like “read file.txt”). Of course this means you have to be judicial with imports.


+1 on that Opus 4.5 is a game changer I have used to refactor and modernize one of my old react project using bootstrap, You have to be really precise when prompting and having solid CLAUDE.md works really well

In other words, permanent instructions and context well presented in *.md, planning and review before execution, agentic loops with feedback, and a good model.

You can do this with any agentic harness, just plain prompting and "LLM management skills". I don't have Claude Code at work, but all this applies to Codex and GH Copilot agents as well.

And agreed, Opus 4.5 is next level.


I would LOVE to use Opus 4.5, but it means I (a merely Pro peon) can work for maybe 30 minutes a day, instead of 60-90.

I’m old enough to remember being able to work at programming related tasks without any such tools. Is that not still a thing?

I obviously meant "work with it" not work in general.

And as for old, I'm 47. I've been programming since I got my first C64 in 1985.


If a tool craps out after 30 minutes every day, and someone knows they can't rely upon it to work when you needed it, they tend to change workflow to avoid the tool entirely.

Context switching between AI-assisted coding and "oops, my tool is refusing to function, guess I'll stop using it" is often worse for productivity than never using the AI to begin with.


And if I may, these advice also apply if you choose Cursor as a coding environment.

Claude code basically does not use CLAUDE.md but wish it did

I’ve yet to see any real work get done with agents. Can you share examples or videos of real production level work getting done? Maybe in a tutorial format?

My current understanding is that it’s for demos and toy projects


Good question. Why hasn't there been a profusion of new game-changing software, fixes to long-standing issues in open-source software, any nontrivial shipped product at all? Heck, why isn't there a cornucopia of new apps, even trivial ones? Where is all the shovelware [0]? Previous HN discussion here [1].

Don't get me wrong, AI is at least as game-changing for programming as StackOverflow and Google were back in the day. I use it every day, and it's saved me hours of work for certain specific tasks [2]. But it's simply not a massive 10x force multiplier that some might lead you to believe.

I'll start believing when maintainers of complex, actively developed, and widely used open-source projects (e.g. ffmpeg, curl, openssh, sqlite) start raving about a massive uptick in positive contributions, pointing to a concrete influx of high-quality AI-assisted commits.

[0] https://mikelovesrobots.substack.com/p/wheres-the-shovelware...

[1] https://news.ycombinator.com/item?id=45120517

[2] https://news.ycombinator.com/item?id=45511128


"Heck, why isn't there a cornucopia of new apps, even trivial ones?"

There is. We had to basically create a new category for them on /r/golang because there was a quite distinct step change near the beginning of this year where suddenly over half the posts to the subreddit were "I asked my AI to put something together, here's a repo with 4 commits, 3000 lines of code, and an AI-generated README.md. It compiles and I may have even used it once or twice." It toned down a bit but it's still half-a-dozen posts a day like that on average.

Some of them are at least useful in principle. Some of them are the same sorts of things you'd see twice a month, only now we can see them twice a week if not twice a day. The problem wasn't necessarily the utility or the lack thereof, it was simply the flood of them. It completely disturbed the balance of the subreddit.

To the extent that you haven't heard about these, I'd observe that the world already had more apps than you could possibly have ever heard about and the bottleneck was already marketing rather than production. AIs have presumably not successfully done much about helping people market their creations.


Well, the LLM industry is not completely without results. We do have ever increasing frequency of outages in major Internet services...Somehow correlates with the AI mandates major tech corps seem to pushing now internally.

Disclaimer: I am not promoting llms.

There was a GitHub PR on the ocaml project where someone crafted a long feature (mac silicon debugging support). The pr was rejected because nobody wanted to read it for it was too long. Seems to me that society is not ready for the width of output generated this way. Which may explain the lack of big visible change so far. But I already see people deploying tiny apps made by Claude in a day.

It's gonna be weird...


As another example, the MacApps Reddit has been flooded with new apps recently.

The effect of these tools is people losing their software jobs (down 35% since 2020). Unemployed devs aren’t clamoring to go use AI on OSS.

Wasn't most of that caused by that one change in 2022 to how R&D expenses are depreciated, thus making R&D expenses (like retaining dev staff) less financially attractive?

Context: This news story https://news.ycombinator.com/item?id=44180533


Yes! Even though it's only a tax rule for USA, it somehow applied for the whole world! Thats how mighty the US is!

Or could it be, after the growth and build, we are in maintenance mode and we need less people?

Just food for thought


Probably also end of ZIRP and some “AI washing” to give the illusion of progress

Same thing happened to farmers during the industrial revolution, same thing happened to horse drawn carriage drivers, same thing happened to accountants when Excel came along, mathmaticins, and on and on the list goes. Just part of human peogress.

I keep asking chatgpt when will LLM reach 95% software creation automation, answer is ten years.

I don't think that long, but yeah, I give it five years.

Two years and 3/4 will be not needed anymore


I don't have all the variables in (financials of openai debt etc) but a few articles mention that they leverage part of their work to {claude,gemini,chatgpt} code agents internally with good results. it's a first step in a singularity like ramp up.

People think they'll have jobs maintaining AI output but i don't see how maintaining is that harder than creating for a llm able to digest requirements and codebase and iterate until a working source runs.


I use GitHub Copilot in Intellij with Claude Sonnet and the plan mode to implement complete features without me having to code anything.

I see it as a competent software developer but one that doesn't know the code base.

I will break down the tasks to the same size as if I was implementing it. But instead of doing it myself, I roughly describe the task on a technical level (and add relevant classes to the context) and it will ask me clarifying questions. After 2-3 rounds the plan usually looks good and I let it implement the task.

This method works exceptionally well and usually I don't have to change anything.

For me this method allows me to focus on the architecture and overall structure and delegate the plumbing to Copilot.

It is usually faster than if I had to implement it and the code is of good quality.

The game changer for me was plan mode. Before it, with agent mode it was hit or miss because it forced me to one shot the prompt or get inaccurate results.


> I see it as a competent software developer but one that doesn't know the code base.

I know what you mean, but the thing I find windsurf (which we moved to from copilot) most useful (except writing opeanapi spec files) is asking it questions about the codebase. Just random minutiae that I could find by grepping or following the code, but would take me more than the 30s-1m it takes it. For reference, this is a monorepo of a bit over 1M LoC (and 800k YAML files, because, did I mention I hate API specs?), so not really a small code base either.

> I will break down the tasks to the same size as if I was implementing it. But instead of doing it myself, I roughly describe the task on a technical level (and add relevant classes to the context) and it will ask me clarifying questions. After 2-3 rounds the plan usually looks good and I let it implement the task.

Here I disagree, sort of. I almost never ask it to do complex tasks, the most time consuming and hardest part is not actually typing out the code, describing it to an AI takes me almost as much time as implementing for most things. One thing I did find very useful is the supertab feature of windsurf, which, at a high level, looks at the changes you started making and starts suggesting the next change. And it's not only limited to repetitive things (like . in vi), if you start adding a parameter to a function, it starts adding it to the docs, to the functions you need below, and starts implementing it.

> For me this method allows me to focus on the architecture and overall structure and delegate the plumbing to Copilot.

Yeah, a coworker said this best, I give it the boring work, I keep the fun stuff for myself.


My experience is that GitHub Copilot works much better in VS Code than Intellij. Now I have to open them together to work on one single project.

Yeah, but what did you produce with it in the end? Show us the end result please.

I cannot show it because the code belongs to my employer.

Ah yes of course. But no one asked for the code really. Just show us the app. Or is it some kinda super-duper secret military stuff you are not even supposed to discuss, let alone show.

It is neither of these. It's an application that processes data and is not accessible outside of the companies network. Not everything is an app.

I described my workflow that has been a game changer for me, hoping it might be useful to another person because I have struggled to use LLMs for more than a Google replacement.

As an example, one task of the feature was to add metrics for observability when the new action was executed. Another when it failed.

My prompt: Create a new metric "foo.bar" in MyMetrics when MyService.action was successful and "foo.bar.failed" when it failed.

I review the plan and let it implement it.

As you can see it's a small task and after it is done I review the changes and commit them. Rinse and repeat.

I think the biggest issue is that people try to one shot big features or applications. But it is much more efficient to me to treat Copilot as a smart pair programming partner. There you also think about and implement one task after the other.


I've been writing an experimental pipeline-based web app DSL with Claude Code for the last little while in my spare time. Sort of bash-like with middleware for lua, jq, graphql, handlebars, postgres, etc.

Here's an already out of date and unfinished blog post about it: https://williamcotton.com/articles/introducing-web-pipe

Here's a simple todo app: https://github.com/williamcotton/webpipe/blob/webpipe-2.0/to...

Check out the BDD tests in there, I'm quite proud of the grammar.

Here's my blog: https://github.com/williamcotton/williamcotton.com/blob/mast...

It's got an LSP as well with various validators, jump to definitions, code lens and of course syntax highlighting.

I've yet to take screenshots, make animated GIFs of the LSP in action or update the docs, sorry about that!

A good portion of the code has racked up some tech debt, but hey, it's an experiment. I just wanted to write my own DSL for my own blog.


Here's one - https://apps.apple.com/us/app/pistepal/id6754510927

The app is definitely still a bit rough around the edges but it was developed in breakneck speed over the last few months - I've probably seen an overall 5x acceleration over pre-agentic development speed.


I use Junie to get tasks done all the time. For instance I had two navigation bars in an application which had different styling and I told it make the second one look like the first and... it made a really nice patch. Also if I don't understand how to use some open source dependency I check the project out and ask Junie questions about it like "How do I do X?" or "How does setting prop Y have the effect of Z?" and frequently I get the right answer right away. Sometimes I describe a bug in my code and ask if it can figure it out and often it does, ask for a fix and often get great results.

I have a React application where the testing situation is FUBAR, we are stuck on an old version of React where tests like enzyme that really run react are unworkable because the test framework can never know that React is done rendering -- working with Junie I developed a style of true unit tests for class components (still got 'em) that tests tricky methods in isolation. I have a test file which is well documented explaining the situation around tests and ask "Can we make some tests for A like the tests in B.test.js, how would you do that?" and if I like the plan I say "make it so!" and it does... frankly I would not be writing tests if I didn't have that help. It would also be possible to mock useState() and company and might do that someday... It doesn't bother me so much that the tests are too tightly coupled because I can tell Junie to fix or replace the tests if I run into trouble.

For me the key things are: (1) understanding from a project management perspective how to cut out little tasks and questions, (2) understanding enough coding to know if it is on the right track (my non-technical boss has tried vibe coding and gets nowhere), (3) accepting that it works sometimes and sometimes it doesn't, and (4) recognizing context poisoning -- sometimes you ask it to do something and it gets it 95% right and you can tell it to fix the last bit and it is golden, other times it argues or goes in circles or introduces bugs faster than it fixes them and as quickly as you can you recognize that is going on and start a new session and mix up your approach.


Manually styling two similar things the same way is a code smell. Ask the ai to make common components and use them for both instead of brute forcing them to look similar.

Yeah, I thought about this in that case. I tend to think the way you do to the extent that it is sometimes a source of conflict with other people I work with.

These navbars are similar but not the same, both have a pager but they have other things, like one has some drop downs and the other has a text input. Styled "the same" means the line around the search box looks the same as the lines around the numbers in the pager, and Junie got that immediately.

In the end the patch touched css classes in three lines of one file and added a css rule -- it had the caveat that one of the css classes involved will probably go away when the board finally agrees to make a visual change we've been talking about for most of a year but I left a comment in the first navbar warning about that.

There are plenty of times I ask Junie to try to consolidate multiple components or classes into one and it does that too as directed.



This is a lot of good reasons not to use it yet IMO

I know of many experienced and capable engineers working on complex stuff who are driving basically all their development through agents. This includes production level work. This is the norm now in the SV startup world at least.

You don't just YOLO it. You do extensive planning when features are complex, and you review output carefully.

The thing is, if the agent isn't getting it to the point where you feel like you might need to drop down and edit manually, agents are now good enough to do those same "manual edits" with nearly 100% reliability if you are specific enough about what you want to do. Instead of "build me x, y, z", you can tell it to rename variables, restructure functions, write specific tests, move files around, and so on.

So the question isn't so much whether to use an agent or edit code manually—it's what level of detail you work at with the agent. There are still times where it's easier to do things manually, but you never really need to.


Can you show some example? I feel like there would be streams or YouTube lets plays on this if it was working well

I would like to see it as well. It seems to me that everybody sells shovels only. But nobody haven’t seen gold yet. :)

The real secret to agent productivity is letting go of your understanding of the code and trusting the AI to generate the proper thing. Very pro agent devs like ghuntley will all say this.

And it makes sense. For most coding problems the challenge isn’t writing code. Once you know what to write typing the code is a drop in the bucket. AI is still very useful, but if you really wanna go fast you have to give up on your understanding. I’ve yet to see this work well outside of blog posts, tweets, board room discussions etc.


> The real secret to agent productivity is letting go of your understanding of the code and trusting the AI to generate the proper thing

The few times I've done that, the agent eventually faced a problem/bug it couldn't solve and I had to go and read the entire codebase myself.

Then, found several subtle bugs (like writing private keys to disk even when that was an explicit instruction not to). Eventually ended up refactoring most of it.

It does have value on coming up with boilerplate code that I then tweak.


You made the mistake of looking at the code, though. If you didn't look at the code, you wouldn't have known those bugs existed.

fixing code now is orders of magnitude cheaper than fixing it in month or two when it hits production.

which might be fine if you're doing proof of concept or low risk code, but it can also bite you hard when there is a bug actively bleeding money and not a single person or AI agent in the house that knows how anything work


That's just irresponsible advice. There is so little actual evidence of this technology being able to produce high quality maintainable code that asking us to trust it blindly is borderline snake-oil peddling.

Not borderline - it is just straight snake-oil peddling.

yet it works? where have you been for the last 2 years?

calling this snake oil is like when the horse carriage riders were against cars.


I am an early adopter since 2021 buddy. "It works" for trivial use-cases, for anything more complex it is utter crap.

> The real secret to agent productivity is letting go of your understanding of the code

This is negligence, it's your job to understand the system you're building.


I don’t see how I would feel comfortable pushing the current output of LLMs into high-stakes production (think SLAs, SRE).

Understanding of the code in these situation is more important than the code/feature existing.


I agree and am the same. Using them to enhance my knowledge and as well as autocomplete on steroids is the sweet spot. Much easier to review code if im “writing” it line by line.

I think the reality is a lot of code out there doesn’t need to be good, so many people benefit from agents etc.


You can use an agent while still understanding the code it generates in detail. In high stakes areas, I go through it line by line and symbol by symbol. And I rarely accept the first attempt. It’s not very different from continually refining your own code until it meets the bar for robustness.

Agents make mistakes which need to be corrected, but they also point out edge cases you haven’t thought of.


Definitely agreed, that is what I do as well. At that point you have good understanding of that code, which is in contrast to what the post I responded suggests.

Not to blow your bubble, but I've seen agents expose Stripe credentials by hardcoding them as text into a react frontend app, so, no kids, do not "let go" of code understanding, lest you want to appear as the next story along the lines of "AI dropped my production database".

This is sarcasm right?

I wish, that's dev brain on AI sadly.

We've been unfucking architecture done like that for a month after the dev that had hallucination session with their AI left.


+1 here. Lets see those productivity gains!

A lot of that would be people working on proprietary code I guess. And most of the people I know who are doing this are building stuff, not streaming or making videos. But I'm sure there must be content out there—none of this is a secret. There are probably engineers working on open source stuff with these techniques who are sharing it somewhere.

That’s understandable, I also wouldn’t stream my next idea for everyone to see

Let’s see it then

go on reddit and you can see a million of these vibe coded codebases. is that not good enough?

> I add to my team’s CLAUDE.md multiple times a week.

This concerns me because fighting tooling is not a positive thing. It’s very negative and indicates how immature everything is.


The Claude MD is like the documentation you hand to a new engineer on your team that explains details about your code that they wouldn't otherwise know. It's not bad to need one.

But that documentation shouldn’t need to be updated nearly every other day.

Consider that every time you start a session with Claude Code. It's effectively a new engineer. The system doesn't learn like a real person does, so for it to improve over time you need to manually record the insights that for a normal human would be integrated by the natural learning process.

Yes, that's exactly the problem. There's good reasons why any particular team doesn't onboard new engineers each day, going all the way back to Fred Brooks and "adding more people to a late project makes it later".

Reminds me of that Nicole Kidman movie Before I Go to Sleep.

there are many tools available that work towards solving this problem

Sleep time compute architectures are changing this.

I certainly could be updating the documentation for new devs very frequently - the problem with devs is that they don't bother reading the documentation.

and the other problem - when they see something is wrong/out of date, they don't update it...

If you are consistent with how you do your projects you shouldn't need to update CLAUDE.md nearly every day. Early on, I was adjusting it nearly every day for maybe a couple of projects but now I have very little need to make any adjustments.

Often the challenge is users aren't interacting with Claude Code about their rules file. If Claude Code doesn't seem to be working with you ask it why it ignore a rule. Often times it provides very useful feedback to adjust the rules and no longer violate them.

Another piece of advice I can give is to clear your context window often! Early in my start in this I was letting the context window auto compact but this is bad! Your model is it's freshest and "smartest" when it has a fresh context window.


It takes a lot of uncached tokens to let it learn about your project again.

Same thing happens every time a new hire joins the team. Lots of documentation is stale and needs updating as they onboard.

But that documentation shouldn’t need to be updated nearly every other day.

It does if it’s incomplete or otherwise doesn’t accurately convey what people need to know.

And something is terribly wrong if it is constantly in that state despite near daily updates.

Why not?

Have you never looked at your work's Confluence? Worse, have you never spent time at a company where the documentation wasn't frequently updated?

Do you have nothing but onboarding material on yours and somehow still need to update it several times a week?

You might be misunderstanding what a CLAUDE.md is. It’s not about fighting the model, rather it’s giving the model a shortcut to get the context it needs to do its work. You don’t have to have one. Ours is 100% written by Claude itself.

That's not the same thing as adding rules by yourself based on your experiences with Claude.

In addition, Having Claude Code's code and plans evaluated is very valid. It makes calm decision for AI agents.

Does the same happens if I create an AGENTS.md instead?

Claude Code does not support AGENTS.md, you can symlink it to CLAUDE.md to workaround it. Anthropic: pls support!


Use AGENTS.md for everything, then put a single line in CLAUDE.md:

  @AGENTS.md

Get a grep!

How do you make Claude code to choose opus and not sonnet? For me it seems to do it automatically

/model

> Use Opus 4.5.

This drives up price faster than quality though. Also increases latency.


Opus 4.5 is significantly better if you can afford it.

They also recently lowered the price for Opus 4.5, so it is only 1.67x the price of Sonnet, instead of 5x for Opus 4.


There's a counterintuitive pricing aspect of Opus-sized LLMs in that they're so much smarter that in some cases, it can solve the problem faster and with much fewer tokens that it can end up being cheaper.

Obviously the Anthropic employee advertising their product wants you to pay as much as possible for it.

The generosity of the Max plans indicates otherwise.

God bless these generously benevolent corporations, giving us such amazing services for the low low price of only $200 per month. I'm going to subscribe right now! I almost feel bad, it's like I'm stealing from them.

That $200 a month is getting me $2000 a month in API equivalent tokens.

I used to spend $200+ an hour on a single developer. I'm quite sure that benevolence was a factor when they submitted me an invoice, since there is no real transparency if I was being overbilled or not or that the developer acted in my best interest rather than theirs.

I'll never forget that one contractor who told me he took a whole 40 hours to do something he could have done in less than that, specifically because I allocated that as an upperbound weekly budget to him.


> That $200 a month is getting me $2000 a month in API equivalent tokens.

Do you ever feel bad for basically robbing these poor people blind? They're clearly losing so much money by giving you $1800 in FREE tokens every month. Their business can't be profitable like this, but thankfully they're doing it out of the goodness of their hearts.


I'm not sure that you actually expect to be taken seriously if you're going to assert that these companies don't have costs themselves to deliver their services.

Even 500 would be cheap, if it can replace one developer

> 1. If there is anything Claude tends to repeatedly get wrong, not understand, or spend lots of tokens on, put it in your CLAUDE.md.

What a joke. Claude regularly ignores the file. It is a toss up: we were playing a game at work to guess which items will it forget first: to run tests, formatter, linter etc. This is despite items saying ABSOLUTELY MUST, you HAVE To and so long.

I have cancelled my Claude Max subscription. At least Codex doesn’t tell me that broken tests are unrelated to its changes or complain that fixing 50 tests is too much work.


Hey Boris, can you teach CC how to use cd?

Personally, CLAUDE_BASH_MAINTAIN_PROJECT_WORKING_DIR=1 made all my cd problems go away (which were only really in cmake-based projects to begin with).

Does all my code get uploaded to the service?

3. Puppeteer? Or Playwright? I haven't been able to make Puppeteer work for the past 8 weeks or so ("failed to reconnect"). Do you have a doc on this?

I know the Playwright MCP server works great. I use it daily.

Same, I use Playwright all the time, but haven't been able to make puppeteer work in quite some time. Playwright, while reliable in terms of features, just absolutely eats the heck out of context.

I’ve heard folks claim the Chrome DevTools MCP eats less context, but I don’t know how accurate that is.

Hey Boris from the Claude Code team - could you guys please be so kind so as to stop pushing that narrative about CLAUDE.md, either yourselves or through influencers and GenAI-grifters? The reason being, it is simply not true. A lot of the time the instructions will be ignored. Actually, the term "ignored" is putting the bar too high, because your tool does not intentionally "ignore", not having sentience and knowledge. We experience the effects of the instructions being ignored, because your software is not deterministic, its merely guessing the next token, and sometimes those instructions tacked onto the rest of the context statistically do not match what we as humans expect to see (while its perfectly logical for your machine learning text generator, based on the datasets it was trained on).

This seems pretty aggressive considering this is all just personal anecdote.

I update my CLAUDE.md all the time and notice the effects.

Why all the snark?


Is it really just a personal anecdote ? Please do read some other comments on this post. The snark comes from everyone and their mother recommending "just write CLAUDE.md", when it is clear that this technology does not have intrinsic capability to perform reliable outputs based on human language input.

Yeah… that’s the point of LLMs: variable output. If you’re using them for 100% consistent output, you’re using the wrong tool.

Is it? So you are saying software should not be consistent? Or that LLMs should not be used for software development, aside from toy-projects?

CLAUDE.md is read on session startup.

If you're continually finding that it's being forgotten, maybe you're not starting fresh sessions often enough.


I should not have to fight tooling, especially the supposedly "intelligent" one. What's the point of it, if we have to always adapt to the tool, instead of the other way around?

It's a tool. The first time you used a shell you had to learn it. The first time you used a text editor you had to learn it.

You can learn how to use it, or you can put it down if you think it doesn't bring you any benefit.


even shell remembers my commands...

I am sorry but what do I have to learn? That the tool does not work as advertised? That sometimes it will work as advertised, sometimes not? That it will sometimes expose critical secrets as plain text and some other time suggest to solve a problem in a function by removing the function code completely? What are you even talking about, comparing to shell and text editors? These are still bloody deterministic tools. You learn how they work and the usage does not change unpredictably every day! How can you learn something that does not have predictable outputs?

Yes, you have to learn those things. LLMs are hard to use.

So are animals, but we've used dogs and falcons and truffle hunting pigs as tools for thousands of years.

Non-deterministic tools are still tools, they just take a bunch more work to figure out.


It's like having Michael Jordan with dementia on your team. You start out mesmerized by how many points he can score, and then you get incredibly frustrated that he forgets he has to dribble and shoot into the correct hoop.

Spot on. Not to mention all the fouls and traveling the demented "all star" makes for your team, effectively negating any point gains.

No, please, stop misleading people Simon. People use tools to make things easier for them, not harder. And a tool which I cannot steer predictably is not a god damn tool at all! The sheer persistence the AI-promoters like you are willing to invest just to gaslight us all into thinking we were dumb and did not know how to use the shit-generators is really baffling. Understand that a lot of us are early adopters and we see this shit for what it is - the most serious mess up of the "Big Tech" since Zuckerberg burned 77B for his metaverse idiocy. By the way - animals are not tools. People do not use them - they engage with them as helpers, companions and for some people, even friends of sorts. Drop your LLM and try engaging with someone who has a hunting dog for example - they'd be quite surprised if you referred to their beloved retriever as a "tool". And you might learn something about a real intelligence.

Your insistence that LLMs are not useful tools is difficult for me to empathize with as someone who has been using them successfully as useful tools for several years - and sharing in great detail how I am using them.

https://simonwillison.net/2025/Dec/10/html-tools/ is the 37th post in my series about this: https://simonwillison.net/series/using-llms/

https://simonwillison.net/2025/Mar/11/using-llms-for-code/ is probably still my most useful of those.

I know you absolutely hate being told you're holding them wrong... but you're holding them wrong.

They're not nearly as unpredictable as you appear to think they are.

One of us is misleading people here, and I don't think it's me.


> One of us is misleading people here, and I don't think it's me.

Firstly, I am not the one with an LLM-influencer side-gig. Secondly - No sorry, please don't move the goalposts. You did not answer my main argument - which is - how does a "tool" which constantly change its behaviour deserve being called a tool at all? If a tailor had scissors which cut the fabric sometimes just a bit, and sometimes completely differently every time they used it, would you tell the tailor he is not using them right too? Thirdly you are now contradicting yourself. First you said we need to live with the fact that they are un-predictable. Now you are sugarcoating it into being "a bit unpredictable", or "not as nearly unpredictable". I am not sure if you are doing this intentionally or do you really want to believe in the "magic" but either way you are ignoring the ground tenets of how this technology works. I'd be fine if they used it to generate cheap holiday novels or erotica - but clearly after four years of experimenting with the crap machines to write code created a huge pushback in the community - we don't need the proverbial scissors which cut our fabric differently each time!


> how does a "tool" which constantly change its behaviour deserve being called a tool at all?

Let's go with blast furnaces. They're definitely tools. They change over time - a team might constantly run one for twenty years but still need to monitor and adjust how they use it as the furnace itself changes behavior due to wear and tear (I think they call this "drift".)

The same is true of plenty of other tools - pottery kilns, cast iron pans, knife sharpening stones. Expert tool users frequently use tools that change over time and need to be monitored and adjusted.

I do think dogs and horses other animal tools remain an excellent example here as well. They're unpredictable and you have to constantly adapt to their latest behaviors.

I agree that LLMs are unpredictable in that they are non-deterministic by nature. I also think that this is something you can learn to account for as you build experience.

I just fed this prompt to Claude Code:

  Add to_text() and to_markdown() features to justhtml.html - for the whole document or for CSS selectors against it
  
  Consult a fresh clone of the justhtml Python library (in /tmp) if you need to
It did exactly what I expected it would do, based on my hundred of previous similar interventions with that tool: https://github.com/simonw/tools/pull/162

Whether its blast furnaces or carbon fiber, the wear and tear (macroscopic changes) as well as material fatigue (molecular changes) is something that will be specified by the manufacturer, within some margin of error and you pretty much know what to expect - unless you are a smartass billionaire building an improvised sub off of carbon fiber whose expiry date was long due. However, the carbon fiber or your blast furnace wont break just on their own. So it's a weak analogy and a stretch at that. Now for your experiment: it has no value because a) you and me both know if you told your LLM that their output was shit, they would immediately "agree" with you and go off to produce some other crap b) For this to be a scientifically valid experiment at all, I'd expect on the order of 10.000 repetitions, each providing exactly the same output. But also on this you and me both know already the 2nd iteration will introduce some changes. So stop fighting the obvious and repeat after me: LLMs are shit for any serious work.

Why would I agree that "LLMs are shit for any serious work" when I've been using them for serious work for two+ years, as have many other people who's skills I respected from before LLMs came along?

I wrote about another solid case study this morning: https://simonwillison.net/2025/Dec/14/justhtml/

I genuinely don't understand how you can look at all of this evidence and still conclude that they aren't useful for people who learn how to use them.


> Let's go with blast furnaces. They're definitely tools. They change over time - a team might constantly run one for twenty years but still need to monitor and adjust how they use it as the furnace itself changes behavior due to wear and tear (I think they call this "drift".)

Now let's make the analogy more accurate: let's imagine the blast furnace often ignores the operator controls, and just did what it "wanted" instead. Additionally, there are no gauges and there is no telemetry you can trust (it might have some that can the furnace will occasionally falsify, but you won't know when it's doing that).

Let's also imagine that the blast furnace changes behavior minute-to-minute (usually in the middle of the process) between useful output, useless output (requires scrapping), and counterproductive output (requires rework which exceeds the productivity gains of using the blast furnace to begin with).

Furthermore, the only way to tell which one of those 3 options you got, is to manually inspect every detail of every piece of every output. If you don't do this, the output might leak secrets (or worse) and bankrupt your company.

Finally, the operator would be charged for usage regardless of how often the furnace actually worked. At least this part of the analogy already fits.

What a weird blast furnace! Would anyone try to use this tool in such a scenario? Not most experienced metalworkers. Maybe a few people with money to burn. In particular, those who sing the highest praises of such a tool would likely be ignorant of all these pitfalls, or have a vested interest in the tool selling.


You appear to be arguing that powerful, unpredictable tools like LLMs need to be run carefully with plenty of attention paid to catching their mistakes and designing systems around them (like sandboxed coding agent harnesses) that allow them to be operated productively and safely.

I couldn't agree more.


> You appear to be arguing that powerful, unpredictable tools like LLMs need to be run carefully with plenty of attention

I did not say that. I said that most metalworkers familiar with all the downsides (only 1 of which you are referring to here) would avoid using such an unpredictable, uncontrollable, uneconomical blast furnace entirely.

A regular blast furnace requires the user to be careful. A blast furnace which randomly does whatever it wants from minute to minute, producing bad output more often than good, including bad output that costs more to fix than the furnace cost to run, more than any cost savings, with no way to tell or meaningfully control it, is pretty useless.

Saying "be careful" using a machine with no effective observability or predictability or controls is a silly misnomer, when no amount of care will bestow the machine with them.

What other tools work this way, and are in widespread use? You mentioned horses, for example: What do you think usually happens to a deranged, rabid, syphilitic working horse which cannot effectively perform any job with any degree of reliability, and which often unpredictably acts out in dangerous and damaging ways? Is it usually kept on the job and 'run carefully'? Of course not.


> I know you absolutely hate being told you're holding them wrong... but you're holding them wrong.

Wow, was that a shark just then?


> So are animals, but we've used dogs and falcons and truffle hunting pigs as tools for thousands of years.

Dogs learn their jobs way faster, more consistently and more expressively than any AI tool.

Trivially, dogs understand "good dog" and "bad dog" for example.

Reinforcement learning with AI tooling clearly seems not to work.


> Dogs learn their jobs way faster, more consistently and more expressively than any AI tool.

That doesn't match my experience with dogs or LLMs at all.


Ever heard of service dogs? Or police dogs? Now tell me, when will LLMs ever be safe to be used as assistance to blind people? Or will the big tech at some point release some sloppy blind-people-tool based on LLMs and unleash the LLM-influencers like yourself to start gaslighting the users into thinking they were "not holding it right" ? For mission and life critical problems, I'll take a dog any day, thank you very much!

I've talked to a few people who are blind about vision LLMs and they're very, very positive about them.

They fully understand their limitations. Users of accessibility technology are extremely good at understanding the precise capabilities of the tools they use - which reminds me that screenreaders themselves are a great example of unreliable tools due to the shockingly bad web apps that exist today.

I've also discussed the analogy to service dogs with them, which they found very apt given how easily their assistive tool could be distracted by a nearby steak.

The one thing people who use assistive technology do not appreciate is being told that they shouldn't try a technology out themselves because it's unreliable and hence unsafe for them to use!


Please for once answer the question being asked without replacing both the question and the stated intention with something else. I was willing to give you the benefit of doubt, but I am now really wondering where does your motivation for these vaguely constructed "analogies" coming from, is the LLM industry that desperate? We were all "positive" about LLM possibilities once. I am asking you, when will LLMs be so reliable that they can be used in place of service dogs for blind people ? Do you believe that this technology will ever be that safe. Have you ever actually seen a service dog? I don't think you can distract a service dog with a steak - did you know they start their training basically from year one of age and it takes up to two years to train them. Do you think they spend those two years learning to fetch properly? Also I never said people should not be allowed to "try" a technology. But like with drugs, the tools for impaired, sick etc. also undergo a verification and licensing process, I am surprised you did not know that. So I am asking you again, can you ever imagine an LLM passing those high regulatory hurdles, so that they can be safely used for assisting the impaired people? Service dogs must be doing something right, if so many of them are safely assisting so many people today, don't they ?

You’ve asked the right questions and don’t want to find the answers. It’s on you.

I understand you're trying to be helpful but the number of "you're holding it wrong" things I read about this tool — any AI tool — just makes me wonder who vibe coders are really doing all this unpaid work for.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: