I agree that trying to mitigate prompt injection in isolation is futile, as ther...

krethh · 2026-02-16T03:19:09 1771211949

> define communication protocols between them that fail when prompt injections are present

There's the "draw the rest of the owl" of this problem.

Until we figure out a robust theoretical framework for identifying prompt injections (not anywhere close to that, to my knowledge - as OP pointed out, all models are getting jailbroken all the time), human-in-the-loop will remain the only defense.

CuriouslyC · 2026-02-16T03:58:02 1771214282

Human in the loop isn't the only defense, you can't achieve complete injection coverage, but you can have an agent convert untrusted input into a response schema with a canary field, then fail any agent outputs that don't conform to the schema or don't have the correct canary value. This works because prompt injection scrambles instruction following, so the odds that the injection works, the isolated agent re-injects into the output, and the model also conforms to the original instructions regarding schema and canary is extremely low. As long as the agent parsing untrusted content doesn't have any shell or other exfiltration tools, this works well.

krethh · 2026-02-16T07:02:14 1771225334

This only works against crude attacks which will fail the schema/canary check, but does next to nothing for semantic hijacking, memory poisoning and other more sophisticated techniques.

CuriouslyC · 2026-02-16T13:43:41 1771249421

With misinformation attacks, your can instruct research agent to be skeptical and thoroughly validate claims made by untrusted sources. TBH, I think humans are just as likely to fall for these sorts of attacks if not more-so, because we're lazier than agents and less likely to do due diligence (when prompted).

SpicyLemonZest · 2026-02-16T17:37:37 1771263457

Humans are definitely just as vulnerable. The difference is that no two humans are copies of the same model, so the blast radius is more limited; developing an exploit to convince one human assistant that he ought to send you money doesn't let you easily compromise everyone who went to the same school as him.