Remix.run Logo
moffkalast 18 hours ago

That's kind of nonsense, since if I ask you what's five times six, you don't do the math in your head, you spit out the value of the multiplication table you memorized in primary school. Doing the math on paper is tool use, which models can easily do too if you give them the option, writing adhoc python scripts to run the math you ask them to with exact results. There is definitely a lot of generalization going on beyond just pattern matching, otherwise practically nothing of what everyone does with LLMs daily would ever work. Although it's true that the patterns drive an extremely strong bias.

Arguably if you're grading LLM output, which by your definition cannot be novel, then it doesn't need to be graded with something that can. The gist of this grading approach is just giving them two examples and asking which is better, so it's completely arbitrary, but the grades will be somewhat consistent and running it with different LLM judges and averaging the results should help at least a little. Human judges are completely inconsistent.

ranyume 18 hours ago | parent [-]

> if I ask you what's five times six, you don't do the math in your head, you spit out the value of the multiplication table you memorized in primary school

Memorization is one ability people have, but it's not the only one. In the case of LLMs, it's the only ability it has.

Moreover, let's make this clear: LLMs do not memorize the same way people do, they don't memorize the same concepts people do, and they don't memorize the same content people do. This is why LLMs "have hallucinations", "don't follow instructions", "are censored", and "makes common sense mistakes" (these are words people use to characterize LLMs).

> nothing of what everyone does with LLMs daily would ever work

It "works" in the sense that the LLM's output serves a purpose designated by the people. LLMs "work" for certain tasks and don't "work" for others. "Working" doesn't require reasoning from an LLM, any tool can "work" well for certain tasks when used by the people.

> averaging the results should help at least a little

Averaging the LLM grading just exacerbates the illusion of LLM reasoning. It only confuses people. Would you ask your hammer to grade how well scissors cut paper? You could do that, and the hammer would say it gets the job done but doesn't cut well because it needs to smash the paper instead of cutting it; Your hammer's just talking in a different language. It's the same here. The LLMs output doesn't necessarily measure what the instructions in the prompt say.

> Human judges are completely inconsistent.

Humans can be inconsistent, but how well the LLM adapts to humans is itself a metric of success.