▲ | red75prime 6 days ago | |||||||||||||
Organizations face a similar problem: how to make reliable/secure processes out of fallible components (humans). The difference is that humans don't react in the same way to the same stimulus, so you can't hack all of them using the same trick, while computers react in a predictable way. Maybe (in absence of long-term memory that would allow to patch such holes quickly) it would make sense to render LLMs less predictable in their reactions to adversarial stimuli by randomly perturbing initial state several times and comparing the results. Adversarial stimuli should be less robust to such perturbation as they are artifacts of insufficient training. | ||||||||||||||
▲ | simonw 6 days ago | parent [-] | |||||||||||||
LLMs are already unpredictable in their responses which adds to the problem: you might test your system against a potential prompt injection three times and observe it resist the attack: an attacker might try another hundred times and have one of their attempts work. | ||||||||||||||
|