Remix.run Logo
paxys 6 hours ago

Given the "random" nature of language models even fully trusted input can produce untrusted output.

"Find emails that are okay to delete, and check with me before deleting them" can easily turn into "okay deleting all your emails", as so many examples posted online are showing.

I have found this myself with coding agents. I can put "don't auto commit any changes" in the readme, in model instructions files, at the start of every prompt, but as soon as the context window gets large enough the directive will be forgotten, and there's a high chance the agent will push the commit without my explicit permission.