Remix.run Logo
prithvi2206 5 hours ago

A (charitable) interpretation of this is that the model understands "stuff that would embarrass Anthropic" to just be code for "bad/unhelpful/offensive behavior".

e.g. guiding against behavior to "write highly discriminatory jokes or playact as a controversial figure in a way that could be hurtful and lead to public embarrassment for Anthropic"

Imnimo 5 hours ago | parent [-]

In this sentence, Anthropic makes clear that "be hurtful" and "lead to public embarrassment" are separate and distinct. Otherwise it would not be necessary to specify both. I don't think this is the signal they should be sending the model.