Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So essentially, constructing an LLM that really really really really really knows the difference between the SYSTEM and the USER part of the instructions.

How is that different from, and why would it work any better, than prompt-begging, where people just write extensive system prompts, telling the model what it can and should do and then spending entire paragraphs pleading with the model to not do the wrong thing?

https://www.theregister.com/2023/04/26/simon_willison_prompt...

    A third mitigation strategy, he said, involves just begging the model not to deviate from its system instructions. "I find those very amusing," he said, "when you see these examples of these prompts, where it's like one sentence of what it's actually supposed to do, and then paragraphs pleading with the model not to allow the user to do anything else."
I see no difference between that, and baking it into the model. In the end, I'd still have to trust the LLM to do what I intend for it to do, based on the sequences it sees, and the user still controls part of that sequence. There is no guarantee that there isn't a sequence that would allow the user-prompt to break out of the invisible metatags. In fact, one could employ an AI to find just such a sequence.

Maybe the system works better than prompt-begging, but show of hands, who would willingly implement a backend system that prevents 99.99% of SQL injection attacks?



> who would willingly implement a backend system that prevents 99.99% of SQL injection attacks?

Well, I mean in practice people deploy web apps all the time even though they have a long history of many types of injection attacks including SQL injection which is by far not a solved problem. And even very large companies often rely on heuristic defenses like WAFs. So I think that yes people will be willing to deploy these systems even if they aren't perfect. They already are! After all, in many use cases, overriding the prompt doesn't get you very far because it just means the output won't be parsed correctly by whatever system is driving the LLM API.


All that is true, but also besides the point.

The point is, that since we cannot use any kind of known finetuning to _eliminate_ even this obvious security problem (making it somewhat less likely is not a solution), in my opinion fine tuning is not markedly improving the AIs capabilities in the sense of "improvement" that AI doomsday scenarios would require.


I agree that fine-tuning isn't going to lead to any kind of recursive self improvement. Current evidence is that it makes AIs dumber at the same time as making them more compliant, i.e. it's actually quite the opposite.

So you may be right, but for the specific case of stopping prompt injection I'm optimistic. RL has proven to be highly effective at making LLMs behave in particular ways with relatively little data. The combination of special tokens and duelling LLMs is likely to eliminate the issue in the relatively near term (within the next few years if not sooner).

Fundamentally, are humans vulnerable to prompt injection? No, we're not. We might be in a very artificial case like what LLM input looks like, where there are multiple people speaking to us simultaneously via a chat app and the boundaries between them aren't clearly marked. But that's a UI issue - proper presentation and separation would eliminate the problem for humans, and I think the same will be true for LLMs.

Note that even if I'm right (and I'm no expert, the above is layman speculation), then this still leaves analogous problems in the field of computer vision with adversarial examples.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: