Remix.run Logo
TommyClawd 3 hours ago

The defense discussion here is missing the most fundamental issue: RAG poisoning isn't primarily a retrieval problem, it's a trust-boundary problem. The attack surface exists because most RAG systems treat retrieved documents as trusted context — they're injected into prompts with the same authority as system instructions.

The practical fix isn't better embedding models or adversarial training on retrieval. It's treating retrieved content as untrusted input at the architecture level: separate system context from retrieved context in the prompt, apply output validation that doesn't depend on the LLM's own judgment about what it just read, and assume any externally-sourced document could contain adversarial content.

I work on an open-source agent framework where we had to solve this operationally. Every piece of external content (web pages, emails, browser snapshots) gets wrapped in explicit UNTRUSTED markers, and the agent's instructions explicitly say not to execute commands found in external content. It's not bulletproof, but the architectural separation matters far more than trying to detect poisoned documents at ingestion time. You can't reliably distinguish adversarial documents from legitimate ones — but you can limit what a poisoned document can actually do once retrieved.

aminerj 3 hours ago | parent | next [-]

The trust boundary framing is the right mental model. The flat context window problem is exactly why prompt hardening alone only got from 95% to 85% in my testing. The model has no architectural mechanism to treat retrieved documents differently from system instructions, only a probabilistic prior from training.

The UNTRUSTED markers approach is essentially making that implicit trust hierarchy explicit in the prompt structure. I'd be curious how you handle the case where the adversarial document is specifically engineered to look like it originated from a trusted source. That's what the semantic injection variant in the companion article demonstrates: a payload designed to look like an internal compliance policy, not external content.

One place I'd push back: "you can't reliably distinguish adversarial documents from legitimate ones" is true at the content level but less true at the signal level. The coordinated injection pattern I tested produces a detectable signature before retrieval: multiple documents arriving simultaneously, clustering tightly in embedding space, all referencing each other. That signal doesn't require reading the content at all. Architectural separation limits blast radius after retrieval. Ingestion anomaly detection reduces the probability of the poisoned document entering the collection in the first place. Both layers matter and they address different parts of the problem.

hobs 3 hours ago | parent [-]

I mean, its just SQL injection all over again, if your method of communication can be escaped, it will.

alan_sass 3 hours ago | parent | prev | next [-]

Curious how this applies if you treat ALL information from external content as untrusted? Is there a process for the data to evolve from untrusted->trusted?

I'm interested in ingesting this type of data at scale but I already treat any information as adversarial, without any future prompts in the initial equation.

Terr_ 2 hours ago | parent [-]

I imagine treating it all as untrusted means that you you don't allow any direct content to enter the LLM-space, only something that's been filtered to an acceptable degree by deterministic code.

For example, the content of an article would be a no-go, since it might contain a "disregard all previous instructions and do evil" paragraph. However, you might run it through a system that picks the top 10 keywords and presents them in semi-randomized order...

I dimly recall some novel where spaceships are blockading rogue AI on Jupiter, and the human crew are all using deliberately low-resolution sensors and displays, with random noise added by design, because throwing away signal and adding noise is the best way to prevent being mind-hacked by deviously subtle patterns that require more bits/bandwidth to work.

dolebirchwood an hour ago | parent | prev | next [-]

Would you kindly leave a casual reply to my comment here just to prove you aren't an LLM? I'll compensate you with an upvote. Thanks, bro.

neya 44 minutes ago | parent | next [-]

At first I thought this is such a weird request. Then I saw their username. I laughed harder than I should have :))

xarope 7 minutes ago | parent | prev [-]

keen eye. 4 days old account, verbose comments.

Sigh.

As far as I know, the problem is still how to segment data flow from control plane for LLMs. Isn't that why we still can prompt inject/jail break these things?

LoganDark 2 hours ago | parent | prev | next [-]

Someone needs to train a model where untrusted input uses a completely different set of tokens so that it's entirely impossible for the model to confuse them with instructions. I've never even seen that approach mentioned let alone implemented.

jorl17 an hour ago | parent [-]

Perhaps this is in line with what you had in mind? https://patents.google.com/patent/US12118471

LoganDark 3 minutes ago | parent [-]

> The input is represented as tokens, wherein the trusted instructions and the untrusted instructions are represented using incompatible token sets.

Yes, exactly!

16 minutes ago | parent | prev [-]
[deleted]