Specifying the message role should be considered a suggestion, not a hardened rule.

I've not seen a single example of an LLM that can reliably follow its system prompt against all forms of potential trickery in the non-system prompt.

Solve that and you've pretty much solved prompt injection!

▲

koakuma-chan 5 days ago | parent [-]

> The lack of a 100% guarantee is entirely the problem.

I agree, and I agree that when using models there should always be the assumption that the model can use its tools in arbitrary ways.

> Solve that and you've pretty much solved prompt injection!

But do you think this can be solved at all? For an attacker who can send arbitrary inputs to a model, getting the model to produce the desired output (e.g. a malicious tool call) is a matter of finding the correct input.

edit: how about limiting the rate at which inputs can be tried and/or using LLM-as-a-judge to assess legitimacy of important tool calls? Also, you can probably harden the model by finetuning to reject malicious prompts; model developers probably already do that.

	▲	simonw 5 days ago \| parent [-]
		I continue to hope that it can be solved but, after three years, I'm beginning to lose faith that a total solution will ever be found. I'm not a fan of the many attempted solutions that try to detect malicious prompts using LLMs or further models: they feel doomed to failure to me, because hardening the model is not sufficient in the face of adversarial attackers who will keep on trying until they find an attack that works. The best proper solution I've seen so far is still the CaMeL paper from DeepMind: https://simonwillison.net/2025/Apr/11/camel/