Remix.run Logo
krackers 7 hours ago

>They can't be emergent generation, surely

It is. It's what you get when you RLHF for catchy, agreeable, enthusiastic responses. The content doesn't matter, it's the "style" that becomes applied like a coat of paint over anything. That's how you end up with the corpspeak-esque yet chilling sentences mentioned in https://news.ycombinator.com/item?id=45845871

What would be nice is for OpenAI to do a retrospective here and perform some interoperability research. Does the LLM even "realize" (in the sense of the residual stream encoding those concepts) that it is encouraging suicide? I'd almost hypothesize that the process of RLHF'ing and selecting for sycophancy diminishes those circuits, effectively lobotomizing the LLM (much like safety training does) so it responds only to the shallow immediate context, missing the forest for the trees.