| ▲ | kouteiheika 2 hours ago | ||||||||||||||||||||||
There is one way to practically guarantee than no prompt injection is possible, but it's somewhat situational - by finetuning the model on your specific, single task. For example, let's say you want to use an LLM for machine translation from English into Klingon. Normally people just write something like "Translate the following into Klingon: $USER_PROMPT" using a general purpose LLM, and that is vulnerable to prompt injection. But, if you finetune a model on this well enough (ideally by injecting a new special single token into its tokenizer, training with that, and then just prepending that token to your queries instead of a human-written prompt) it will become impossible to do prompt injection on it, at the cost of degrading its general-purpose capabilities. (I've done this before myself, and it works.) The cause of prompt injection is due to the models themselves being general purpose - you can prompt it with essentially any query and it will respond in a reasonable manner. In other words: the instructions you give to the model and the input data are part of the same prompt, so the model can confuse the input data as being part of its instructions. But if you instead fine-tune the instructions into the model and only prompt it with the input data (i.e. the prompt then never actually tells the model what to do) then it becomes pretty much impossible to tell it to do something else, no matter what you inject into its prompt. | |||||||||||||||||||||||
| ▲ | calpaterson 2 hours ago | parent | next [-] | ||||||||||||||||||||||
I thought about mentioning fine-tuning. Obviously as you say there are some costs (the re-training) and then also you lose the general purpose element of it. But I am still unsure that it actually is robust. I feel like you're still vulnerable to Disregard That in that you may find that the model just starts to ignore your instruction in favour of stuff inside the context window. An example where OpenAI have this problem: they ultimately train in a certain content policy. But people quite often bully or trick chat.openai.com into saying things that go against that content policy. For example they say "it's hypothetical" or "just for a thought experiment" and you can see the principle there, I hope. Training-in your preferences doesn't seem robust in the general sense. | |||||||||||||||||||||||
| ▲ | martijnvds 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
Wouldn't that leave ways to do "phone phreaking" style attacks, because it's an in-band signal? | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | BoorishBears 2 hours ago | parent | prev [-] | ||||||||||||||||||||||
This doesn't work for the tasks people are worried about because they want to lean on the generalization of the model + tool calling. What you're describing is also already mostly achieved by using constrained decoding: if the injection would work under constrained decoding, it'll usually still work even if you SFT heavily on a single task + output format | |||||||||||||||||||||||