Remix.run Logo
krackers 4 hours ago

LLMs already do this and have a system role token. As I understand in the past this was mostly just used to set up the format of the conversation for instruction tuning, but now during SFT+RL they probably also try to enforce that the model learns to prioritize system prompt against user prompts to defend against jailbreaks/injections. It's not perfect though, given that the separation between the two is just what the model learns while the attention mechanism fundamentally doesn't see any difference. And models are also trained to be helpful, so with user prompts crafted just right you can "convince" the model it's worth ignoring the system prompt.

marcus_holmes 3 hours ago | parent | next [-]

Thanks that's useful.

So it's still one stream of tokens as far as the LLM is concerned, but there is some emphasis in training on "trust the system prompt", have I got that right?

veganmosfet 3 hours ago | parent | prev [-]

This! And even more, the role model extends beyond system and user: system > user > tool > assistant. This reflects "authority" and is one of the best "countermeasure": never inject untrusted content in "user" messages, always use "tool".