I do something similar, but across three doc types: design, plan, and debug
Design works similar to your project.md file, but on a per feature request. I also explicitly ask it to outline open questions/unknowns.
Once the design doc (i.e. design/[feature].md) has been sufficiently iterated on, we move to the plan doc(s).
The plan docs are structured like `plan/[feature]/phase-N-[description].md`
From here, the agent iterates until the plan is "done" only stopping if it encounters some build/install/run limitation.
At this point, I either jump back to new design/plan files, or dive into the debug flow. Similar to the plan prompting, debug is instructed to review the current implementation, and outline N-M hypotheses for what could be wrong.
We review these hypotheses, sometimes iterate, and then tackle them one by one.
An important note for debug flows, similar to manual debugging, it's often better to have the agent instrument logging/traces/etc. to confirm a hypothesis, before moving directly to a fix.
Using this method has led to a 100% vibe-coded success rate both on greenfield and legacy projects.
Note: my main complaint is the sheer number of markdown files over time, but I haven't gotten around to (or needed to) automate this yet, as sometimes these historic planning/debug files are useful for future changes.
My "heavy" workflow for large changes is basically as follows:
0. create a .gitignored directory where agents can keep docs. Every project deserves one of these, not just for LLMs, but also for logs, random JSON responses you captured to a file etc.
1. Ask the agent to create a file for the change, rephrase the prompt in its own words. My prompts are super sloppy, full of typos, with 0 emphasis put on good grammar, so it's a good first step to make sure the agent understands what I want it to do. It also helps preserve the prompt across sessions.
2. Ask the agent to do research on the relevant subsystems and dump it to the change doc. This is to confirm that the agent correctly understands what the code is doing and isn't missing any assumptions. If something goes wrong here, it's a good opportunity to refactor or add comments to make future mistakes less likely.
3. Spec out behavior (UI, CLI etc). The agent is allowed to ask for decisions here.
4. Given the functional spec, figure out the technical architecture, same workflow as above.
5. High-level plan.
6. Detailed plan for the first incomplete high-level step.
7. Implement, manually review code until satisfied.
> At this point, I either jump back to new design/plan files, or dive into the debug flow. Similar to the plan prompting, debug is instructed to review the current implementation, and outline N-M hypotheses for what could be wrong.
I'm biased because my company makes a durable execution library, but I'm super excited about the debug workflow we recently enabled when we launched both a skill and MCP server.
You can use the skill to tell your agent to build with durable execution (and it does a pretty great job the first time in most cases) and then you can use the MCP server to say things like "look at the failed workflows and find the bug". And since it has actual checkpoints from production runs, it can zero in on the bug a lot quicker.
This is great, giving agents access to logs (dev or prod) tightens the debug flow substantially.
With that said, I often find myself leaning on the debug flow for non-errors e.g. UI/UX regressions that the models are still bad at visualizing.
As an example, I added a "SlopGoo" component to a side project, which uses an animated SVG to produce a "goo" like effect. Ended up going through 8 debug docs[0] until I was satisified.
> giving agents access to logs (dev or prod) tightens the debug flow substantially.
Unless the agent doesn't know what it's doing... I've caught Gemini stuck in an edit-debug loop making the same 3-4 mistakes over and over again for like an hour, only to take the code over to Claude and get the correct result in 2-3 cycles (like 5-10 minutes)... I can't really blame Gemini for that too much though, what I have it working on isn't documented very well, which is why I wanted the help in the first place...
I have a similar process and have thought about committing all the planning files, but I've found that they tend to end up in an outdated state by the time the implementation is done.
Better imo is to produce a README or dev-facing doc at the end that distills all the planning and implementation into a final authoritative overview. This is easier for both humans and agents to digest than bunch of meandering planning files.
> Note: my main complaint is the sheer number of markdown files over time, but I haven't gotten around to (or needed to) automate this yet, as sometimes these historic planning/debug files are useful for future changes.
FWIW, what you describe maps well to Beads. Your directory structure becomes dependencies between issues, and/or parent/children issue relationship and/or labels ("epic", "feature", "bug", etc). Your markdown moves from files to issue entries hidden away in a JSONL file with local DB as cache.
Your current file-system "UI" vs Beads command line UI is obviously a big difference.
Beads provides a kind of conceptual bottleneck which I think helps when using with LLMs. Beads more self-documenting while a file-system can be "anything".
I've been experimenting with a few ways to keep the "historical context" of the codebase relevant to future agent sessions.
First, I tried using simple inline comments, but the agents happily (and silently) removed them, even when prompted not to.
The next attempt was to have a parallel markdown file for every code file. This worked OK, but suffered from a few issues:
1. Understanding context beyond the current session
2. Tracking related files/invocations
3. Cold start problem on an existing codebases
To solve 1 and 3, I built a simple "doc agent" that does a poor man's tree traversal of the codebase, noting any unknowns/TODOs, and running until "done."
To solve 2, I explored using the AST directly, but this made the human aspect of the codebase even less pronounced (not to mention a variety of complex edge-cases), and I found the "doc agent" approach good enough for outlining related files/uses.
To improve the "doc agent" cold start flow, I also added a folder level spec/markdown file, which in retrospect seems obvious.
The main benefit of this system, is that when the agent is working, it not only has to change the source code, but it has to reckon with the explanation/rationale behind said source code. I haven't done any rigorous testing, but in my anecdotal experience, the models make fewer mistakes and cause less regressions overall.
I'm currently toying around with a more formal way to mark something as a human decision vs. an agent decision (i.e. this is very important vs. this was just the path of least resistance), however the current approach seems to work well enough.
If anyone is curious what this looks like, I ran the cold start on OpenAI's Codex repo[0].
In the context of traditional SaaS, using dynamic secrets loaded at runtime (KMS+Dynamo, etc.).
For agentic tools and pure agents, a proxy is the safest approach. The agent can even think it has a real API key, but said key is worthless outside of the proxy setting.
It suprises me how often I see some Dockerfile, Helm, Kubernetes, Ansible etc write .env files to disk in some production-alike environment.
The OS, especially linux - most common for hosting production software - is perfectly capable of setting and providing ENV vars. Almost all common devops and older sysadmin tooling can set ENV vars. Really no need to ever write these to disk.
I think this comes from unaware developers that think a .env file, and runtime logic that reads this file (dotenv libs) in the app are required for this to work.
I certainly see this misconception a lot with (junior) developers working on windows.
- you don't need dotenv libraries searching files, parsing them, etc in your apps runtime. Please just leave it to the OS to provide the ENV vars and read those, in your app.
- Yes, also on your development machine. Plenty of tools from direnv to the bazillion "dotenv" runners will do this for you. But even those aren't required, you could just set env vars in .bashrc, /etc/environment (Don't put them there, though) etc.
- Yes, even for windows, plenty of options, even when developers refuse to or cannot use wsl. Various tools, but in the end, just `set foo=bar`.
Environment variables are -by far- the securest AND most practical way to provide configuration and secrets to apps.
Any other way is less secure: files on disk, (cli)arguments, a database, etc. Or about as secure but far more complex and convoluted. I've seen enterprise hosting with a (virtual) mount (nfs, etc) that provides config files - read only - tight permissions, served from a secure vault. A lot of indirection for getting secrets into an app that will still just read them plain text. More secure than env vars? how?
Or some encrypted database/vault that the app can read from using - a shared secret provided as env var or on-disk config file.
Disagree, the best way to pass secrets is by using mount namespaces (systemd and docker do this under /run/secrets/) so that the can program can access the secrets as needed but they don't exist in the environment. The process is not complicated, many system already implement it. By keeping them out of ENV variables you no longer have to worry about the entire ENV getting written out during a crash or debugging and exposing the secrets.
How does a mounted secret (vault) protect against dumping secrets on crash or debugging?
The app still has it. It can dump it. It will dump it. Django for example (not a security best practice in itself, btw) will indeed dump ENV vars but will also dump its settings.
The solution to this problem lies not in how you get the secrets into the app, but in prohibiting them getting out of it.
E.g. builds removing/stubbing tracing, dumping entirely. Or with proper logging and tracing layers that filter stuff.
There really is no difference, security wise, between logger.debug(system.env) and logger.debug(app.conf)
I’ve found that LLMs seem to work better on LLM-generated codebases.
Commercial codebases, especially private internal ones, are often messy. It seems this is mostly due to the iterative nature of development in response to customer demands.
As a product gets larger, and addresses a wider audience, there’s an ever increasing chance of divergence from the initial assumptions and the new requirements.
We call this tech debt.
Combine this with a revolving door of developers, and you start to see Conway’s law in action, where the system resembles the organization of the developers rather than the “pure” product spec.
With this in mind, I’ve found success in using LLMs to refactor existing codebases to better match the current requirements (i.e. splitting out helpers, modularizing, renaming, etc.).
Once the legacy codebase is “LLMified”, the coding agents seem to perform more predictably.
YMMV here, as it’s hard to do large refactors without tests for correctness.
(Note: I’ve dabbled with a test first refactor approach, but haven’t gone to the lengths to suggest it works, but I believe it could)
Claude by default, unless I tell it not to, will write stuff like:
// we need something to be true
somethingPasses = something()
if (!somethingPasses) {
return false
}
// we need somethingElse to be true
somethingElsePasses = somethingElse()
if (!somethingElsePasses) {
return false
}
return true
instead of the very simple boolean logic that could express this in one line, with the "this code does what it obviously does" comments added all over the place.
generally unless you tell it not to, it does things in very verbose ways that most humans would never do, and since there's an infinite number of ways that it can invent absurd verbosity, it is hard to preemptively prompt against all of them.
to be clear, I am getting a huge amount of value out of it for executing a bunch of large refactors and "modernization" of a (really) big legacy codebase at scale and in parallel. but it's not outputting the sort of code that I see when someone prompts it "build a new feature ...", and a big part of my prompts is screaming at it not to do certain things or to refuse the task if it at any point becomes unsure.
Yeah to be clear it will have the same issues as a flyby contributor if prompted to.
Meaning if you ask it “handle this new condition” it will happily throw in a hacky conditional and get the job done.
I’ve found the most success in having it reason about the current architecture (explicitly), and then to propose a set of changes to accomplish the task (2-5 ways), review, and then implement the changes that best suit the scope of the larger system.
The failure mode is missing constraints, not “coding skill”. Treat the model as a generator that must operate inside an explicit workflow: define the invariant boundaries, require a plan/diff before edits, run tests and static checks, and stop when uncertainty appears. That turns “hacky conditional” behaviour into controlled change.
Right. Each context window is a partial view, so it cannot “know the codebase” unless you supply stable artefacts. Treat project state as inputs: invariants, interfaces, constraints, and a small set of must-keep facts. Then force changes through a plan and a diff, and gate with tests and checks. That turns context limits into a controlled boundary instead of a surprise.
I’ve been “testing” LLM willingness to explore novel ideas/hypotheses for a few random topics[0].
The earlier LLMs were interesting, in that their sycophantic nature eagerly agreed, often lacking criticality.
After reducing said sycophancy, I’ve found that certain LLMs are much more unwilling (especially the reasoning models) to move past the “known” science[1].
I’m curious to see how/if we can strike the right balance with an LLM focused on scientific exploration.
[0]Sediment lubrication due to organic material in specific subduction zones, potential algorithmic basis for colony collapse disorder, potential to evolve anthropomorphic kiwis, etc.
[1]Caveat, it’s very easy for me to tell when an LLM is “off-the-rails” on a topic I know a lot about, much less so, and much more dangerous, for these “tests” where I’m certainly no expert.
1,500 tool calls per task sounds like a nightmare for unit economics though. I've been optimizing my own agent workflows and even a few dozen steps makes it hard to keep margins positive, so I'm not sure how this is viable for anyone not burning VC cash.
True, but that's still 1,500 inference cycles. Even without external API fees, the latency and compute burden seems huge. I don't see how the economics work there without significant subsidies.
An LLM model only outputs tokens, so this could be seen as an extension of tool calling where it has trained on the knowledge and use-cases for "tool-calling" itself as a sub-agent.
Sort of. It’s not necessarily a single call. In the general case it would be spinning up a long-running agent with various kinds of configuration — prompts, but also coding environment and which tools are available to it — like subagents in Claude Code.
> People have said that software engineering at large tech companies resembles "plumbing"
> AI code [..] may also free up a space for engineers seeking to restore a genuine sense of craft and creative expression
This resonates with me, as someone who joined the industry circa 2013, and discovered that most of the big tech jobs were essentially glorified plumbers.
In the 2000s, the web felt more fun, more unique, more unhinged. Websites were simple, and Flash was rampant, but it felt like the ratio of creators to consumers was higher than now.
With Claude Code/Codex, I've built a bunch of things that usually would die at a domain name purchase or init commit. Now I actually have the bandwidth to ship them!
This ease of dev also means we'll see an explosion in slopware, which we're already starting to see with App Store submissions up 60% over the last year[0].
My hope is that, with the increase of slop, we'll also see an increase in craft. Even if the proportion drops, the scale should make up for it.
We sit in prefab homes, cherishing the cathedrals of yesteryear, often forgetting that we've built skyscrapers the ancient architects could never dream of.
More software is good. Computers finally work the way we always expected them to!
> joined the industry circa 2013, and discovered that most of the big tech jobs were essentially glorified plumbers
Most tech jobs are glorified plumbers. I've worked in big tech and in small startups, and most of the code everywhere is unglamorous, boring, just needs to be written.
Satisfaction with the job also depends on what you want out of it. I know people who love building big data pipelines, and people who love building fancy UIs. Those two groups would find the other's job incredibly tedious.
The right job for a person depends on whether they can rise above the specific flavor of pain that the job dishes out. BigTech jobs strike me as having an inextricable political element to them: so you enjoy jockeying for titles and navigating constant reorgs?
The pay is nice but I find myself…remarkably unenvious as I get older.
Big companies are political and re-orgs lead to layoffs. Startups are a constant battle for funding and go out of business. Small companies mean a lot of exposure to bad management and budget issues. Charities are highly regulated and audited environments. Government jobs have no perks and entrenched middle management.
Every type of work has its idiosyncrasies, which people will either get on with or not. Mentioning one without the others is a bit disingenuous, or its whatever the opposite of the grass-is-greener bias is.
Plumbing has certification and industry best practices, and its leaks generally affect a few blocks at most rather than spraying across the entire internet.
Er... we sit in prefab homes? Trailers are generally considered to be the worst possible quality of home construction and actually lose value instead of the normal appreciation real estate has.
One thing that surprised me when diving into the Codex internals was that the reasoning tokens persist during the agent tool call loop, but are discarded after every user turn.
This helps preserve context over many turns, but it can also mean some context is lost between two related user turns.
A strategy that's helped me here, is having the model write progress updates (along with general plans/specs/debug/etc.) to markdown files, acting as a sort of "snapshot" that works across many context windows.
I'm pretty sure that Codex uses reasoning.encrypted_content=true and store=false with the responses API.
reasoning.encrypted_content=true - The server will return all the reasoning tokens in an encrypted blob you can pass along in the next call. Only OpenaAI can decrypt them.
store=false - The server will not persist anything about the conversation on the server. Any subsequent calls must provide all context.
Combined the two above options turns the responses API into a stateless one. Without these options it will still persist reasoning tokens in a agentic loop, but it will be done statefully without the client passing the reasoning along each time.
Maybe it's changed, but this is certainly how it was back in November.
I would see my context window jump in size, after each user turn (i.e. from 70 to 85% remaining).
Built a tool to analyze the requests, and sure enough the reasoning tokens were removed from past responses (but only between user turns). Here are the two relevant PRs [0][1].
When trying to get to the bottom of it, someone from OAI reached out and said this was expected and a limitation of the Responses API (interesting sidenote: Codex uses the Responses API, but passes the full context with every request).
This is the relevant part of the docs[2]:
> In turn 2, any reasoning items from turn 1 are ignored and removed, since the model does not reuse reasoning items from previous turns.
Thanks. That's really interesting. That documentation certainly does say that reasoning from previous turns are dropped (a turn being an agentic loop between user messages), even if you include the encrypted content for them in the API calls.
I wonder why the second PR you linked was made then. Maybe the documentation is outdated? Or maybe it's just to let the server be in complete control of what gets dropped and when, like it is when you are using responses statefully? This can be because it has changed or they may want to change it in the future. Also, codex uses a different endpoint than the API, so maybe there are some other differences?
Also, this would mean that the tail of the KV cache that contains each new turn must be thrown away when the next turn starts. But I guess that's not a very big deal, as it only happens once for each turn.
> And here’s where reasoning models really shine: Responses preserves the model’s reasoning state across those turns. In Chat Completions, reasoning is dropped between calls, like the detective forgetting the clues every time they leave the room. Responses keeps the notebook open; step‑by‑step thought processes actually survive into the next turn. That shows up in benchmarks (TAUBench +5%) and in more efficient cache utilization and latency.
I think the delta may be an overloaded use of "turn"? The Responses API does preserve reasoning across multiple "agent turns", but doesn't appear to across multiple "user turns" (as of November, at least).
In either case, the lack of clarity on the Responses API inner-workings isn't great. As a developer, I send all the encrypted reasoning items with the Responses API, and expect them to still matter, not get silently discarded[0]:
> you can choose to include reasoning 1 + 2 + 3 in this request for ease, but we will ignore them and these tokens will not be sent to the model.
> I think the delta may be an overloaded use of "turn"? The Responses API does preserve reasoning across multiple "agent turns", but doesn't appear to across multiple "user turns" (as of November, at least).
It depends on the API path. Chat completions does what you describe, however isn't it legacy?
I've only used codex with the responses v1 API and there it's the complete opposite. Already generated reasoning tokens even persist when you send another message (without rolling back) after cancelling turns before they have finished the thought process
Also with responses v1 xhigh mode eats through the context window multiples faster than the other modes, which does check out with this.
That’s what I used to think, before chatting with the OAI team.
The docs are a bit misleading/opaque, but essentially reasoning persists for multiple sequential assistant turns, but is discarded upon the next user turn[0].
The diagram on that page makes it pretty clear, as does the section on caching.
I think it might be a good decision though, as it might keep the context aligned with what the user sees.
If the reasoning tokens where persisted, I imagine it would be possible to build up more and more context that's invisible to the user and in the worst case, the model's and the user's "understanding" of the chat might diverge.
E.g. image a chat where the user just wants to make some small changes. The model asks whether it should also add test cases. The user declines and tells the model to not ask about it again.
The user asks for some more changes - however, invisibly to the user, the model keeps "thinking" about test cases, but never telling outside of reasoning blocks.
So suddenly, from the model's perspective, a lot of the context is about test cases, while from the user's POV, it was only one irrelevant question at the beginning.
This is effective and it's convenient to have all that stuff co-located with the code, but I've found it causes problems in team environments or really anywhere where you want to be able to work on multiple branches concurrently. I haven't come up with a good answer yet but I think my next experiment is to offload that stuff to a daemon with external storage, and then have a CLI client that the agent (or a human) can drive to talk to it.
worktrees are good but they solve a different problem. Question is, if you have a lot of agent config specific to your work on a project where do you put it? I'm coming around to the idea that checked in causes enough problems it's worth the pain to put it somewhere else.
## Task Management
- Use the projects directory for tracking state
- For code review tasks, do not create a new project
- Within the `open` subdirectory, make a new folder for your project
- Record the status of your work and any remaining work items in a `STATUS.md` file
- Record any important information to remember in `NOTES.md`
- Include links to MRs in NOTES.md.
- Make a `worktrees` subdirectory within your project. When modifying a repo, use a `git worktree` within your project's folder. Skip worktrees for read-only tasks
- Once a project is completed, you may delete all worktrees along with the worktrees subdirectory, and move the project folder to `completed` under a quarter-based time hierarchy, e.g. `completed/YYYY-Qn/project-name`.
More stuff, but that's the basics of folder management, though I haven't hooked it up to our CI to deal with MRs etc, and have never told it that a project is done, so haven't ironed out whether that part of the workflow works well. But it does a good job of taking notes, using project-based state directories for planning, etc. Usually it obeys the worktree thing, but sometimes it forgets after compaction.
I'm dumb with this stuff, but what I've done is set up a folder structure:
And then in dev/AGENTS.md, I say to look at ai-workflows/AGENTS.md, and that's our team sharable instructions (e.g. everything I had above), skills, etc. Then I run it from `dev` so it has access to all repos at once and can make worktrees as needed without asking. In theory, we all should push our project notes so it can have a history of what changed when, etc. In practice, I also haven't been pushing my project directories because they have a lot of experimentation that might just end up as noise.
worktrees are a bunch of extra effort. if your code's well segregated, and you have the right config, you can run multiple agents in the same copy of the repo at the same time, so long as they're working on sufficiently different tasks.
I do this sometimes - let Claude Code implement three or four features or fixes at the same time on the same repository directory, no worktrees. Each session knows which files it created, so when you ask CC to commit the changes it made in this session, it can differentiate them. Sometimes it will think the other changes are temporary artifacts or results of an experiment and try to clear them (especially when your CLAUDE.md contains an instruction to make it clean after itself), so you need to watch out for that. If multiple features touch the same file and different hunks belong to different commits, that's where I step in and manually coordinate.
I'm insane and run sessions in parallel. Claude.md has Claude committing to git just the changes that session made, which lets me pull each sessions changes into their own separate branch for review without too much trouble.
that's the main gripe I have with codex; I want better observability into what the AI is doing to stop it if I see it going down the wrong path. in CC I can see it easily and stop and steer the model. in codex, the model spends 20m only for it to do something I didn't agree on. it burns OpenAI tokens too; they could save money by supporting this feature!
I’ve been using agent-shell in emacs a lot and it stores transcripts of the entire interaction. It’s helped me out lot of times because I can say ‘look at the last transcript here’.
It’s not the responsibility of the agent to write this transcript, it’s emacs, so I don’t have to worry about the agent forgetting to log something. It’s just writing the buffer to disk.
Same here! I think it would be good if this could be made by default by the tooling. I've seen others using SQL for the same and even the proposal for a succinct way of representing this handoff data in the most compact way.
That could explain the "churn" when it gets stuck. Do you think it needs to maintain an internal state over time to keep track of longer threads, or are written notes enough to bridge the gap?
Sonnet has the same behavior: drops thinking on user message. Curiously in the latest Opus they have removed this behavior and all thinking tokens are preserved.
but that's why I like Codex CLI, it's so bare bone and lightweight that I can build lots tools on top of it. persistent thinking tokens? let me have that using a separate file the AI writes to. the reasoning tokens we see aren't the actual tokens anyway; the model does a lot more behind the scenes but the API keeps them hidden (all providers do that).
Codex is wicked efficient with context windows, with the tradeoff of time spent. It hurts the flow state, but overall I've found that it's the best at having long conversations/coding sessions.
It's worth it at the end of the day because it tends to properly scope out changes and generate complete edits, whereas I always have to bring Opus around to fix things it didn't fix or manually loop in some piece of context that it didn't find before.
That said, faster inference can't come soon enough.
> That said, faster inference can't come soon enough.
why is that? technical limits? I know cerebras struggles with compute and they stopped their coding plan (sold out!). their arch also hasn't been used with large models like gpt-5.2. the largest they support (if not quantized) is glm 4.7 which is <500B params.
where do you save the progress updates in? and do you delete them afterwards or do you have like 100+ progress updates each time you have claude or codex implement a feature or change?
Design works similar to your project.md file, but on a per feature request. I also explicitly ask it to outline open questions/unknowns.
Once the design doc (i.e. design/[feature].md) has been sufficiently iterated on, we move to the plan doc(s).
The plan docs are structured like `plan/[feature]/phase-N-[description].md`
From here, the agent iterates until the plan is "done" only stopping if it encounters some build/install/run limitation.
At this point, I either jump back to new design/plan files, or dive into the debug flow. Similar to the plan prompting, debug is instructed to review the current implementation, and outline N-M hypotheses for what could be wrong.
We review these hypotheses, sometimes iterate, and then tackle them one by one.
An important note for debug flows, similar to manual debugging, it's often better to have the agent instrument logging/traces/etc. to confirm a hypothesis, before moving directly to a fix.
Using this method has led to a 100% vibe-coded success rate both on greenfield and legacy projects.
Note: my main complaint is the sheer number of markdown files over time, but I haven't gotten around to (or needed to) automate this yet, as sometimes these historic planning/debug files are useful for future changes.
reply