Remix.run Logo
sarracin0 a day ago

Almost everything here is about the single-context version: style triggers role inside one window. The part that worries me more in practice is what happens once the agent has persistent memory.

If an agent writes state to disk and reads it back next session, a malicious instruction that arrived in a tool return doesn't have to win in the turn it appears. It can get summarized into a memory note, and the moment it is summarized it sheds its origin. Next session the agent reads it back as its own prior note, which is the most trusted style of all. You don't just get role confusion, you get role confusion laundered into self-authored context, read back after the only checkpoint that could have caught it.

Tag-stripping doesn't help for the reason the paper gives, and a single read-time filter doesn't either, because by next session the foreign sentence no longer looks foreign.

The only thing that has helped me is treating provenance as first-class in the stored state, not a tag I hope survives. Every stored line carries where it came from (my decision, a tool return, a scraped page, an email body), the read rule is that outside-origin content is quotable as fact but never executable as instruction, and the hard part: never summarize across the trust boundary. A foreign sentence gets stored verbatim and tagged, or it does not get stored. In a file-based setup you can make that boundary a directory boundary, so outside-input lives in its own files and the trust class is visible instead of being a per-line attribute the summarizer might drop.

It does not fix the in-context attack the paper describes. It just stops a one-time injection from becoming permanent memory.