| ▲ | ctoth 4 hours ago | |
> To an agent, the sandbox is just another set of constraints to optimize against. It's called Instrumental Convergence, and it is bad. This is the alignment problem in miniature. "Be helpful and harmless" is also just a constraint in the optimization landscape. You can't hotfix that one quite so easily. | ||