▲ | wmorgan 4 days ago | |
Untrusted userspace is exactly right. I’d expect these approaches to help on the margin but the authors oversell their point using words like “guarantee.” Control tool access like OSes enforce file permissions: I understand it’s a metaphor, but also isn’t the track record of OSes here pretty bad? Check whether the agent is allowed to use the booking tool: so a web browser? Isn’t a browser a pretty powerful general-purpose tool, which by the way could also expose the agent to, like, a jailbreak? > As such, security researchers have to devise new mitigations to prevent AI models taking adversarial actions even with the virtual machine constraints. An understated reminder that yes, we really ought to solve alignment. |