| ▲ | Imnimo 6 hours ago | |||||||
I am somewhat surprised that the constitution includes points to the effect of "don't do stuff that would embarrass Anthropic". That seems like a deviation from Anthropic's views about what constitutes model alignment and safety. Anthropic's research has shown that this sort of training leaks across contexts (e.g. a model trained to write bugs in code will also adopt an "evil" persona elsewhere). I would have expected Anthropic to go out of its way to avoid inducing the model to scheme about PR appearances when formulating its answers. | ||||||||
| ▲ | ekidd 4 hours ago | parent | next [-] | |||||||
I think the actual problem here is that Opus 4.5 is actually pretty smart, and it is perfectly capable of explaining how PR disasters work and why that might be bad for Anthropic and Claude. So Anthropic is describing a true fact about the situation, a fact that Claude could also figure out on its own. So I read these sections as Anthropic basically being honest with Claude: "You know and we know that we can't ignore these things. But we want to model good behavior ourselves, and so we will tell you the truth: PR actually matters." If Anthropic instead engaged in clear hypocrisy with Claude, would the model learn that it should lie about its motives? As long as PR is a real thing in the world, I figure it's worth admitting it. | ||||||||
| ▲ | prithvi2206 6 hours ago | parent | prev [-] | |||||||
A (charitable) interpretation of this is that the model understands "stuff that would embarrass Anthropic" to just be code for "bad/unhelpful/offensive behavior". e.g. guiding against behavior to "write highly discriminatory jokes or playact as a controversial figure in a way that could be hurtful and lead to public embarrassment for Anthropic" | ||||||||
| ||||||||