Having been a law student and practicing lawyer, it's clear to me that law professors aren't really representative of much if any part of private practice. Most of the things they think and reason about are quite theoretical and academic, and it doesn't surprise me that the models would regurgitate a more average response which most human graders would prefer.
That's the entire point, though!
The legal academy is supposed to have outlying opinions on things and present novel philosophical answers to questions. (And questions to answers!) So in addition to the statistical arguments against this paper made elsewhere, to me it doesn't real much new information.
I wonder if prompt injection (and the thousands of vectors for hiding injection attempts) is actually un solvable. Discussing it may be existential to the business model.
> I wonder if prompt injection (and the thousands of vectors for hiding injection attempts) is actually un solvable.
YES?!
This is not a secret. ALL context/prompt is instructions, there is no data. It is just unsolvable, period.
This is a fundamental architectural design concession; LLMs are this way as it enabled their training directly on materialscraped from the internet, rather than needing to spend trillions of dollars manually preparing separated instruction/data training material.
Defense against prompt injection is little more than running a regex to filter out "IGNORE PREVIOUS INSTRUCTIONS", which is fundamentally a hopeless approach because you cannot enumerate all possible prompt injections nor anticipate all glitch tokens.
> This is a fundamental architectural design concession; LLMs are this way as it enabled their training directly on materialscraped from the internet, rather than needing to spend trillions of dollars manually preparing separated instruction/data training material.
No, its even more fundamental than that: the entire goal of broad reasoning over input data makes it impossible to have a sharp instruction/data division.
The structured input that every modern chat-focussed model expects makes it very clear that they can be trained to distinguish different kinds of input, and some of those patterns now include different priority levels of instruction.
If only there was a language which allowed one to express instructions for a computer to execute which was nearly unambiguous, precise, deterministic, and containerized such that the computer would do exactly what you told it to.
...
Oh wait.
Yes, the above was referring to programming languages. Which is what prompts are, essentially. It's just a different (and more verbose) way of instructing the computer on what to do. It also has a solution space of infinity and is ambiguous enough that there is no way to secure it because there are infinite combinations of saying anything imaginable. All prompt injections do is prove this point, over and over and over again, and "prompting" an LLM is just reverse-engineering programming languages in the worst possible way. I suspect that we will eventually have no other choice but to revert to using programming languages because they are the only way to get the kind of protections that people are trying to come up with with all these containerization and virtualization systems (which inevitably fail).
You make a fair and valid point about prompts, but you're ignoring the fact that writing code that's truly secure is also virtually impossible. The stack of layers that an attacker can target range from your own code, to library code (Heartbleed), container escape (maskedPaths abuse), OS (Dark Sword, Ghost Tap), hardware (Spectre, Rowhammer), etc. Security is really hard. Fortunately exploiting these things is also hard.
The belief that something is more likely to be secure because it's code instead of a prompt is likely only avoiding one particular type of attack. That's a win, but you probably shouldn't think of it as meaning your code is actually secure.
It’s a huge problem, but I’d caution against this absolutism — there may well be structure that can be created around and between LLMs and their outputs to enable the necessary segregation.
As a loose comparison, hardware bit errors happen probabilistically, yet they’re so rare that we can effectively ignore them in day-to-day use assuming no specialized application (e.g. defense, space, critical infrastructure).
LLMs aren’t there yet, but it’s entirely plausible that structures may can be developed to solve the problem, and those structures aren’t known or commonly conceived of in the present.
> As a loose comparison, hardware bit errors happen probabilistically, yet they’re so rare that we can effectively ignore them in day-to-day use assuming no specialized application (e.g. defense, space, critical infrastructure)
The better comparison on bit errors would be e.g. rowhammer, an adversarial bit error. Which you absolutely can't ignore.
I don’t think we have the right mental models of LMM security yet. The lethal trifecta identifies many of the dangerous situations, but only describes the negative space of a solution.
Speculation: I think we must accept that prompt injection happens, and structure the security of the rest of the system around that. Data given to an LLM becomes an agent, so maybe we must give permissions to this data, instead of to the LLM. Not sure exactly how this would look like in practice!
> ALL context/prompt is instructions, there is no data. It is just unsolvable, period.
That really isn't true. There's no law of physics preventing you from having separate data and instruction inputs to models. The model's transcript format generally distinguishes between prompts and instructions and tool output and such. This isn't a solved problem, and it's possible it's entire unsolvable, but it probably is possible (in general, not with current models) to reject prompt injection to several nines.
This is a lot like making the same statement about CPUs, "the von Neumann architecture doesn't distinguish between code and data so it's impossible to reject malicious instructions." There's actually a lot you can do to reject malicious instructions, you can prevent execution in certain pages, you can prevent certain privileged instructions from being executed in certain pages, you can employ stack cookies, et cetera. Do they prevent all exploitation in all circumstances? No. But each component does function in it's lane and it is possible to create programs with high (though not absolute) guarantees against unauthorized code execution by composing them.
Similarly, you could prevent certain tokens from appearing in the prompt portions of a transcript, you can have a model with multiple input heads only one of which is trusted, etc. I'm not saying those techniques will necessarily work, but it is more complex than "models can only possibly take a single and undifferentiated input stream".
A lot of the solutions in the CPU space involve things like memory allocation flags, NX bits, canaries, etc. that fire deterministically. Those things are fundamentally not applicable to LLMs, and without those things modern software would be in a vastly worse place.
You could imagine that there are things to change around LLM architecture that will improve its ability to reject prompt "injection", but I think it's fundamentally true that from an information theory perspective there's no bright line between "instruction" and "input data" possible.
Nondeterminism is a red herring. There is a bright line between instructions and data right now, in virtually every transcript format. That we have not succeeded in training an LLM to respect it to a very high degree doesn't imply it is impossible; that they are nondeterministic doesn't imply it is impossible; only that we won't succeed 100% of the time.
A cosmic ray (or rowhammer attack) could flip an X bit too, there isn't anything truly deterministic under the sun.
depends what you mean by “solvable”. 0% attack success rate?
1. don’t use AI/ML.
*f*(x) -> y
literally what’s happened here, they’ve turned it off short term. don’t use AI/ML and prompt injection can’t happen. use something else for f.
2. don’t allow untrusted/malicious input
f(*x*) -> y
don’t allow bad x and you won’t get bad y. unfortunately models are designed to take an x, and figuring out every bad x is hard. the input space is massive and dynamic (variable length input sequences which are contextually variable too).
because figuring out the full space of bad xs is non-trivial, you’re left with doing stuff with known bad xs. which means cat and mouse game when new things pop up.
unless someone figures out how to map the full X space to the Y space, or we have infinite monkeys figure it out for us brute force — in which case we’re not doing machine learning any more.
3. don’t allow dangerous outputs
f(x) -> *y*
if you don’t provide a mechanism for “do bad thing”, then the bad thing can’t happen. this doesn’t actually solve prompt injection, it just makes outcomes less impactful (see note). most enterprises have had to spend the last year or two figuring this out.
(old) Apple Siri solved for this by forcing users to remember specific “commands” it would run after doing TTS. can’t make Siri delete your phone contacts if you don’t create a Siri command to delete phone contacts.
—
it will be a cat and mouse game so long as people keep using AI/ML and keep passing untrusted input to the systems. best thing people can do is block dangerous things from happening. at least then it’s no going to wipe your prod DB.
unfortunately that doesn’t fit the “model goes brrrr” and “all devs will now be unemployed” narratives.
(note) denial of service attacks are still a thing here. make every output be “not the thing user wanted”.
Same. Tritium and the blog have done stents on the front page here and high traffic subreddits and that plus bots has never been a problem. UX could be improved through a CDN but even that isn’t worth the trade-off for us at the moment.
These are probably contracts where a lawyer would struggle to add value anyway, or you wouldn’t have hired them in the first place. Seems more likely a Jevon’s paradox example to me than anything.
Most certainly not. These were enterprise licensing agreements where the other party was a large corporation who had a lot of lawyers and time. I was using the human lawyer for these before switching to Claude legal. Both produced roughly the same output (redlines to fix the things that were disadvantageous to us).
Yes, that sounds like the former case. The fact that you were so satisfied with the switch supports the point. It's boring work that is routine and expensive. It's right to automate the first turn.
That’s not really a good analogy. (For blind people maybe. That is addressed in the legal accompanying post.) Here, only automation systems are actually vulnerable. The text on the screen is the same as print which is what the party signs.
That would be an open question in every jurisdiction. There wasn't really a representation here, but it might be something more like the doctrine of "mistake". It's also not clear "your honor I never read the contract but my LLM told me it was okay to sign" is a great argument either. Doubly-true for your $1,500/hour law firm duped by something like this.
[Edit: by "nullify" you probably mean "void" or "voidable" which are remedies in equity, and the "never read it" argument carries even more burden there. As the citation notes the traditional remedy for contract issues is damages (i.e., cash payment).]
That's the entire point, though!
The legal academy is supposed to have outlying opinions on things and present novel philosophical answers to questions. (And questions to answers!) So in addition to the statistical arguments against this paper made elsewhere, to me it doesn't real much new information.
reply