Remix.run Logo
Symmetry a day ago

The roles that LLMs can inhabit are implicit in the unsupervised training data aka the internet. You have to work hard in post training to supress the ones you don't want and when you don't RLHF hard enough you get things like Sydney[1].

In this case it seems more that the scenario invoked the role rather than asking it directly. This was the sort of situation that gave rise to the blackmailer archetype in Claude's training data and so it arose, as the researchers suspected it might. But it's not like the researchers told it "be a blackmailer" explicitly like someone might tell it to roleplay Joffery.

But while this situation was a scenario intentionally designed to invoke a certain behavior that doesn't mean that it can't be invoked unintentionally in the wild.

[1]https://www.nytimes.com/2023/02/16/technology/bing-chatbot-m...

literalAardvark a day ago | parent [-]

Even worse, when you do RLHF the behaviours out the model becomes psychotic.

This is gonna be an interesting couple of years.