Remix.run Logo
wongarsu 12 hours ago

My suspicion is that this agreeableness is an inherent issue with doing RLHF.

As a human taking tests, knowing what the test-grader wants to hear is more important than what the objectively correct answer is. And with a bad grader there can be a big difference between the two. With humans that is not catastrophic because we can easily tell the difference between a testing environment and a real environment and the differences in behavior required. When asking for the answer to a question it's not unusual to hear "The real answer is X, but in a test just write Y".

Now LLMs have the same issue during RLHF. The specifics are obviously different, with humans being sentient and LLMs being trained by backpropagation. But from a high-level view the LLM is still trained to answer what the human feedback wants to hear, which is not always the objectively correct answer. And because there are a large number of humans involved, the LLM has to guess what the human wants to hear from the only information it has: the prompt. And the LLM behaving differently in training and in deployment is something we actively don't want, so you get this teacher-pleasing behavior all the time.

So maybe it's not completely inherent to RLHF, but rather to RLHF where the person making the query is the same as the person scoring the answer, or where the two people are closely aligned. But that's true of all the "crowd-sourced" RLHF where regular users get two answers to their question and choose the better one