▲ | simonw 6 days ago | ||||||||||||||||||||||
> I guess it's not a 100% effective, but it's something. That's the problem: in the context of security, not being 100% effective is a failure. If the ways we prevented XSS or SQL injection attacks against our apps only worked 99% of the time, those apps would all be hacked to pieces. The job of an adversarial attacker is to find the 1% of attacks that work. The instruction hierarchy is a great example: it doesn't solve the prompt injection class of attacks against LLM applications because it can still be subverted. | |||||||||||||||||||||||
▲ | red75prime 6 days ago | parent [-] | ||||||||||||||||||||||
Organizations face a similar problem: how to make reliable/secure processes out of fallible components (humans). The difference is that humans don't react in the same way to the same stimulus, so you can't hack all of them using the same trick, while computers react in a predictable way. Maybe (in absence of long-term memory that would allow to patch such holes quickly) it would make sense to render LLMs less predictable in their reactions to adversarial stimuli by randomly perturbing initial state several times and comparing the results. Adversarial stimuli should be less robust to such perturbation as they are artifacts of insufficient training. | |||||||||||||||||||||||
|