| ▲ | ethin 3 hours ago | |
How exactly can you "mitigate" prompt injections? Given that the language space is for all intents and purposes infinite, and given that you can even circumvent these by putting your injections in hex or base64 or whatever? Like I just don't see how one can truly mitigate these when there are infinite ways of writing something in natural language, and that's before we consider the non-natural languages one can use too. | ||
| ▲ | lambda 3 hours ago | parent | next [-] | |
The only ways that I can think of to deal with prompt injection, are to severely limit what an agent can access. * Never give an agent any input that is not trusted * Never give an agent access to anything that would cause a security problem (read only access to any sensitive data/credentials, or write access to anything dangerous to write to) * Never give an agent access to the internet (which is full of untrusted input, as well as places that sensitive data could be exfiltrated) An LLM is effectively an unfixable confused deputy, so the only way to deal with it is effectively to lock it down so it can't read untrusted input and then do anything dangerous. But it is really hard to do any of the things that folks find agents useful for, without relaxing those restrictions. For instance, most people let agents install packages or look at docs online, but any of those could be places for prompt injection. Many people allow it to run git and push and interact with their Git host, which allow for dangerous operations. My current experimentation is running my coding agent in a container that only has access to the one source directory I'm working on, as well as the public internet. Still not great as the public internet access means that there's a huge surface area for prompt injection, though for the most part it's not doing anything other than installing packages from known registries where a malicious package would be just as harmful as a prompt injection. Anyhow, there have been various people talking about how we need more sandboxes for agents, I'm sure there will be products around that, though it's a really hard problem to balance usability with security here. | ||
| ▲ | charcircuit 2 hours ago | parent | prev | next [-] | |
If the model is properly aligned then it shouldn't matter if there is an infinite ways for an attacker to ask the model to break alignment. | ||
| ▲ | bilekas 3 hours ago | parent | prev [-] | |
Full mitigation seems impossible to me at least but the obvious and public sandox escape prompts that have been discovered and "patched" out just making it more difficult I guess. But afau it's not possible to fully mitigate. | ||