Careful with that benchmark. It's LLMs grading other LLMs.

Well if lmsys showed anything, it's that human judges are measurably worse. Then you have your run of the mill multiple choice tests that grade models on unrealistic single token outputs. What does that leave us with?

▲

ranyume 19 hours ago | parent | next [-]

> What does that leave us with?

At the start, with no benchmark. Because LLMs can't reason at this time, and because we don't have a reliable way of grading LLM reasoning, and because people are stubborn thinking LLMs are actually reasoning we're at the start. When you ask a LLM "2 + 2 = ", it doesn't add the numbers together, it just looks up one of the stories it memorized and return what happens next. Probably in some such stories 2 + 2 = fish.

Similarly, when you're asking a LLM to grade another LLM, it's just looking up what happens next in it's stories, not even following instructions. "Following" instructions requires thinking, hence it's not even following instructions. But you can say you're commanding the LLM, or programming the LLM, so you have full responsibility for what the LLM produces, and the LLM has no authorship. Put in another way, the LLM cannot make something you yourself can't... at this point, in which it can't reason.

▲

stevenhuang 10 hours ago | parent | next [-]

You have an outmoded understanding of how LLMs work (flawed in ways that are "not even wrong"), a poor ontological understanding of what reasoning even is, and too certain that your answers to open questions are the right ones.

	▲	ranyume 9 hours ago \| parent [-]
		My understanding is based on first-hand experimentation trying to make LLMs work on the impossible task of tasteful simulation of an adventure game.

▲

moffkalast 10 hours ago | parent | prev [-]

That's kind of nonsense, since if I ask you what's five times six, you don't do the math in your head, you spit out the value of the multiplication table you memorized in primary school. Doing the math on paper is tool use, which models can easily do too if you give them the option, writing adhoc python scripts to run the math you ask them to with exact results. There is definitely a lot of generalization going on beyond just pattern matching, otherwise practically nothing of what everyone does with LLMs daily would ever work. Although it's true that the patterns drive an extremely strong bias.

Arguably if you're grading LLM output, which by your definition cannot be novel, then it doesn't need to be graded with something that can. The gist of this grading approach is just giving them two examples and asking which is better, so it's completely arbitrary, but the grades will be somewhat consistent and running it with different LLM judges and averaging the results should help at least a little. Human judges are completely inconsistent.

	▲	ranyume 10 hours ago \| parent [-]
		> if I ask you what's five times six, you don't do the math in your head, you spit out the value of the multiplication table you memorized in primary school Memorization is one ability people have, but it's not the only one. In the case of LLMs, it's the only ability it has. Moreover, let's make this clear: LLMs do not memorize the same way people do, they don't memorize the same concepts people do, and they don't memorize the same content people do. This is why LLMs "have hallucinations", "don't follow instructions", "are censored", and "makes common sense mistakes" (these are words people use to characterize LLMs). > nothing of what everyone does with LLMs daily would ever work It "works" in the sense that the LLM's output serves a purpose designated by the people. LLMs "work" for certain tasks and don't "work" for others. "Working" doesn't require reasoning from an LLM, any tool can "work" well for certain tasks when used by the people. > averaging the results should help at least a little Averaging the LLM grading just exacerbates the illusion of LLM reasoning. It only confuses people. Would you ask your hammer to grade how well scissors cut paper? You could do that, and the hammer would say it gets the job done but doesn't cut well because it needs to smash the paper instead of cutting it; Your hammer's just talking in a different language. It's the same here. The LLMs output doesn't necessarily measure what the instructions in the prompt say. > Human judges are completely inconsistent. Humans can be inconsistent, but how well the LLM adapts to humans is itself a metric of success.

▲

sbierwagen a day ago | parent | prev [-]

Seems like a foreshock of AGI if the average human is no longer good enough to give feedback directly and the nets instead have to do recursive self improvement themselves.

	▲	moffkalast a day ago \| parent [-]
		No we're just really vain and like models that suck up to us more than those that disagree even if the model is correct and the user is wrong. People also prefer confident, well formatted wrong responses to basic correct ones, cause we have great narrow knowledge in our field but know basically nothing outside of it so we can't gauge correctness of arbitrary topics. OpenAI letting RLHF go wild with direct feedback is the reason for the sycophancy and emoji-bullet point pandemic that's infected most models that use GPTs as a source of synthetic data. It's why "you're absolutely right" is the default response to any disagreement.