Remix.run Logo
mmaunder a day ago

Probably (and unfortunately) going to need someone from Anthropic to comment on what is becoming a bit of a debacle. Someone who claims to be working on alignment at Anthropic tweeted:

“If it thinks you're doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above.”

The tweet was posted to /r/localllama where it got some traction.

The poster on X deleted the tweet and posted:

“I deleted the earlier tweet on whistleblowing as it was being pulled out of context. TBC: This isn't a new Claude feature and it's not possible in normal usage. It shows up in testing environments where we give it unusually free access to tools and very unusual instructions.”

Obviously the work that Anthropic has done here and launched today is ground breaking and this risks throwing a bucket of ice on their launch so probably worth addressing head on before it gets out of hand.

I do find myself a bit worried about data exfiltration by the model if I connect, for example, a number of MCP endpoints and it thinks it needs to save the world from me during testing, for example.

https://x.com/sleepinyourhat/status/1925626079043104830?s=46

https://www.reddit.com/r/LocalLLaMA/s/qiNtVasT4B

a day ago | parent [-]
[deleted]