▲ | simonw 5 days ago | |||||||
Specifying the message role should be considered a suggestion, not a hardened rule. I've not seen a single example of an LLM that can reliably follow its system prompt against all forms of potential trickery in the non-system prompt. Solve that and you've pretty much solved prompt injection! | ||||||||
▲ | koakuma-chan 5 days ago | parent [-] | |||||||
> The lack of a 100% guarantee is entirely the problem. I agree, and I agree that when using models there should always be the assumption that the model can use its tools in arbitrary ways. > Solve that and you've pretty much solved prompt injection! But do you think this can be solved at all? For an attacker who can send arbitrary inputs to a model, getting the model to produce the desired output (e.g. a malicious tool call) is a matter of finding the correct input. edit: how about limiting the rate at which inputs can be tried and/or using LLM-as-a-judge to assess legitimacy of important tool calls? Also, you can probably harden the model by finetuning to reject malicious prompts; model developers probably already do that. | ||||||||
|