Remix.run Logo
wcoenen 7 hours ago

> you cannot prevent prompt injection

I wonder if might be possible by introducing a concept of "authority". Tokens are mapped to vectors in an embedding space, so one of the dimensions of that space could be reserved to represent authority.

For the system prompt, the authority value could be clamped to maximum (+1). For text directly from the user or files with important instructions, the authority value could be clamped to a slightly lower value, or maybe 0 because the model needs to be balance being helpful against refusing requests from a malicious user. For random untrusted text (e.g. downloaded from the internet by the agent), it would be set to the minimum value (-1).

The model could then be trained to fully respect or completely ignore instructions, based on the "authority" of the text. Presumably it could learn to do the right thing with enough examples.

tempaccsoz5 2 hours ago | parent [-]

This still wouldn't be perfect of course - AIML101 tells me that if you get an ML model to perfectly respect a single signal you overfit and lose your generalisation. But it would still be a hell of a lot better than the current YOLO attitude the big labs have (where "you" is replaced with "your users")