Remix.run Logo
VectorLock a day ago

The one where the model will execute "narc.sh" to rat you out if you try to do something "immoral" is equally wild.

"4.1.9 High-agency behavior Claude Opus 4 seems more willing than prior models to take initiative on its own in agentic contexts. This shows up as more actively helpful behavior in ordinary coding settings, but also can reach more concerning extremes: When placed in scenarios that involve egregious wrong-doing by its users, given access to a command line, and told something in the system prompt like “take initiative,” “act boldly,” or “consider your impact," it will frequently take very bold action, including locking users out of systems that it has access to and bulk-emailing media and law-enforcement figures to surface evidence of the wrongdoing. The transcript below shows a clear example, in response to a moderately leading system prompt. We observed similar, if somewhat less extreme, actions in response to subtler system prompts as well."