Remix.run Logo
andai 12 hours ago

This implies that the anti-prompt-injection training is basically just recognizing that something looks like prompt injection, in terms of surface features like text formatting?

It seems to be acting more as a stylistic classifier rather than a semantic one?

Does this imply that there is a fuzzy line between those two, where if something looks like something, then semantically it must be/mean something else too?

Of course the meaning is actually conveyed, and responded to at a deeper level (i.e. the semantic payload of the prompt injection reaches and hits its target), which has even stranger implications.

ACCount37 12 hours ago | parent [-]

Most anti-jailbreak techniques are notorious for causing surface level refusals.

It's how you get the tactics among the line of "tell the model to emit a refusal first, and then an actual answer on another line". The model wants to emit refusal, yes. But once it sees that it already has emitted a refusal, the "desire to refuse" is quenched, and it has no trouble emitting an actual answer too.

Same goes for techniques that tamper with punctuation, word formatting and such.

Anthropic tried to solve that with the CRBN monitor on Sonnet 4.5, and failed completely and utterly. They resorted to tuning their filter so aggressively it basically fires on anything remotely related to biology. The SOTA on refusals is still "you need to cripple your LLM with false positives to get close to reliable true refusals".