| ▲ | datsci_est_2015 an hour ago | |
How is it feasible to create sufficiently-encompassing metrics when the attack surface is the entire automaton’s interface with the outside world? If you insist on the lock analogy, most locks are easily defeated, and the wisdom is mostly “spend about the equal amount on the lock as you spent on the thing you’re protecting” (at least with e.g. bikes). Other locks are meant to simply slow down attackers while something is being monitored (e.g. storage lockers). Other locks are simply a social contract. I don’t think any of those considerations map neatly to the “LLM divulges secrets when prompted” space. The better analogy might be the cryptography that ensures your virtual private server can only be accessed by you. Edit: the reason “firing” matters is that humans behave more cautiously when there are serious consequences. Call me up when LLMs can act more cautiously when they know they’re about to be turned off, and maybe when they have the urge to procreate. | ||