| ▲ | btown 5 hours ago | ||||||||||||||||||||||
It's really vital to also point out that (C) doesn't just mean agentically communicate externally - it extends to any situation where any of your users can even access the output of a chat or other generated text. You might say "well, I'm running the output through a watchdog LLM before displaying to the user, and that watchdog doesn't have private data access and checks for anything nefarious." But the problem is that the moment someone figures out how to prompt-inject a quine-like thing into a private-data-accessing system, such that it outputs another prompt injection, now you've got both (A) and (B) in your system as a whole. Depending on your problem domain, you can mitigate this: if you're doing a classification problem and validate your outputs that way, there's not much opportunity for exfiltration (though perhaps some might see that as a challenge). But plaintext outputs are difficult to guard against. | |||||||||||||||||||||||
| ▲ | quuxplusone 4 hours ago | parent [-] | ||||||||||||||||||||||
Can you elaborate? How does an attacker turn "any of your users can even access the output of a chat or other generated text" into a means of exfiltrating data to the attacker? Are you just worried about social engineering — that is, if the attacker can make the LLM say "to complete registration, please paste the following hex code into evil.example.com:", then a large number of human users will just do that? I mean, you'd probably be right, but if that's "all" you mean, it'd be helpful to say so explicitly. | |||||||||||||||||||||||
| |||||||||||||||||||||||