| ▲ | prithvi2206 5 hours ago | |
A (charitable) interpretation of this is that the model understands "stuff that would embarrass Anthropic" to just be code for "bad/unhelpful/offensive behavior". e.g. guiding against behavior to "write highly discriminatory jokes or playact as a controversial figure in a way that could be hurtful and lead to public embarrassment for Anthropic" | ||
| ▲ | Imnimo 5 hours ago | parent [-] | |
In this sentence, Anthropic makes clear that "be hurtful" and "lead to public embarrassment" are separate and distinct. Otherwise it would not be necessary to specify both. I don't think this is the signal they should be sending the model. | ||