| ▲ | gAI 4 hours ago | |
Anthropic’s research makes the case that role-playing is inherent to how the models work. Communication implies a sender. Language implies a writer, and the models learn these roles implicitly during training. RLHF is meant to strengthen the attractor to the Assistant persona. https://www.anthropic.com/research/persona-selection-model https://www.anthropic.com/research/assistant-axis https://www.anthropic.com/research/emergent-misalignment-rew... https://www.anthropic.com/research/emotion-concepts-function | ||
| ▲ | hashmap 2 hours ago | parent [-] | |
The RLHF very much does do that. My take is that RLHF as a mechanism ought to be avoided altogether, and even the selection of the assistant attractor basin is suspect. If I am exploring a problem space I don't want to hire Igor to explore it with me, it's more helpful to have a colleague role who will sort of jump out and say "nah thats dumb what if we throw out that whole thing and do this completely different angle instead". | ||