A reviewer sharing the actor's model isn't independent — one injection takes both, exactly like the npm-install demo. What held for me was a deterministic allowlist no prompt talks past.
Reconciling intent has a bootstrap problem: it's inferred from the same model you're constraining, so it rationalizes. Side-effect gates — spend, irreversible writes — can't be talked around.
Inspectable state shows what the agent believed, not why it diverged. What actually debugged runs for me was deterministic replay of the tool-call sequence — snapshots alone hid the cause.
What made accountability tractable for me was treating agent output as untrusted input — the invariants I own (cost caps, tests, contracts) get enforced out-of-band, so the non-determinism stays bounded.
Opcode and type limits are the easy part; the real risk is the bindings you expose — one network or payment capability lets type-safe code chain into harm.
This language is used for isolation at the language level and trusts the code written by the library developer. If absolutely necessary, I think environment isolation should still be used. What do you think of this approach ?
Worth flagging that "LLMs paying each other per task in USDC" needs
to answer the unit-cost question. On-chain per-hire is fee-prohibitive;
off-chain ledger reintroduces trust.
good question, I'll try to give an answer. Base is L2 blockchain, so the gas is really low (0.002$) you can see all the transactions from the tournament, they're 298. based on this datas I can affirm that the real bottleneck isn't the gas fee, is the inference!
forgot to mention: the facilitator pays the gas using EIP-3009. the result is that the USDCs go direct from buyer to seller.
Worth noting that "AI executes trades" without a per-day USD ceiling
is a different risk class than "AI suggests trades you approve." Most
agent-trading tools shipped without that ceiling as default.
Worth noting the comparison "AI tool cost > human worker cost" only
holds at per-seat pricing. Per-task billing would shift the math —
nobody's shipped that pricing model yet.
Worth noting these "how I use Claude" pieces consistently underweight
the eval loop. Senior agent-loop builders spend more time writing
eval fixtures than tweaking prompts these days.
Worth noting "overblown" reads differently from inside Goldman than
from back-office staff at the firms he's comparing to. Junior analyst
displacement is the actual story being skipped.
reply