(but honestly for a lot of websites and web apps you really can just send it, the stakes are very low for a lot of what most people do, if they're honest with themselves)
I find this absolutely wild. From my experience Codex code quality is still not as good as a human so letting codex do smth and not verifying / cleaning up behind it will most likely result in lower code quality and possibly subtle bugs.
For upgrading frameworks and such there are usually not that many architectural decisions to be made, where you care about how exactly something is implemented. Here the OP could probably verify the build works, with all the expected artifacts quite easily.
I have a `codex-review` skill with a shell script that uses the Codex CLI with a prompt. It tells Claude to use Codex as a review partner and to push back if it disagrees. They will go through 3 or 4 back-and-forth iterations some times before they find consensus. It's not perfect, but it does help because Claude will point out the things Codex found and give it credit.
Sure, but it's still useful insight to see how it performs over time. Of course, cynically, Anthropic could game the benchmark by routing this benchmark's specific prompts to an unadulterated instance of the model.