So essentially, constructing an LLM that really really really really really know...

mike_hearn · on May 29, 2023

> who would willingly implement a backend system that prevents 99.99% of SQL injection attacks?

Well, I mean in practice people deploy web apps all the time even though they have a long history of many types of injection attacks including SQL injection which is by far not a solved problem. And even very large companies often rely on heuristic defenses like WAFs. So I think that yes people will be willing to deploy these systems even if they aren't perfect. They already are! After all, in many use cases, overriding the prompt doesn't get you very far because it just means the output won't be parsed correctly by whatever system is driving the LLM API.

usrbinbash · on May 29, 2023

All that is true, but also besides the point.

The point is, that since we cannot use any kind of known finetuning to _eliminate_ even this obvious security problem (making it somewhat less likely is not a solution), in my opinion fine tuning is not markedly improving the AIs capabilities in the sense of "improvement" that AI doomsday scenarios would require.

mike_hearn · on May 31, 2023

I agree that fine-tuning isn't going to lead to any kind of recursive self improvement. Current evidence is that it makes AIs dumber at the same time as making them more compliant, i.e. it's actually quite the opposite.

So you may be right, but for the specific case of stopping prompt injection I'm optimistic. RL has proven to be highly effective at making LLMs behave in particular ways with relatively little data. The combination of special tokens and duelling LLMs is likely to eliminate the issue in the relatively near term (within the next few years if not sooner).

Fundamentally, are humans vulnerable to prompt injection? No, we're not. We might be in a very artificial case like what LLM input looks like, where there are multiple people speaking to us simultaneously via a chat app and the boundaries between them aren't clearly marked. But that's a UI issue - proper presentation and separation would eliminate the problem for humans, and I think the same will be true for LLMs.

Note that even if I'm right (and I'm no expert, the above is layman speculation), then this still leaves analogous problems in the field of computer vision with adversarial examples.