| ▲ | twotwotwo 10 hours ago | |
This is great--LLMs 'forgetting who they are' is one of the most uncanny things they do, and the note about why static benchmarks underperform human attackers is on point. One sort of wild idea: 'give words a color'. That is, the harness/API adds a signal to the input vector (using a few 'role' dimensions or just adding some other vector to the embedding vector) to tell the model the role of an individual input token. It'd be kind of like how positional info is added. It might make some things a little weird--its output will be 'snapped' to the "tool call" or "assistant output" color when it's read back in, for example, regardless of what 'color' came out of the network. A lot of weird stuff happens in models already, though, and this may be less weird than trying to make them behave as formal grammar parsers reliably with security at stake. A while back I'd dreamed about this as a way to keep models from confusing different kinds of training data: not all input can be high-quality sources, but knowing that a phrase was seen in a scientific paper/encyclopedia, an opinion piece, a work of fiction, a conversation, etc. reduces the chance of confusion. I know they can pick that kind of thing up from other signals like writing style or context, but exactly those signals that lead them astray in prompt injection, and sometimes even leads humans astray when something's written like a credible source but isn't! | ||