> if I ask you what's five times six, you don't do the math in your head, you spit out the value of the multiplication table you memorized in primary school
Memorization is one ability people have, but it's not the only one. In the case of LLMs, it's the only ability it has.
Moreover, let's make this clear: LLMs do not memorize the same way people do, they don't memorize the same concepts people do, and they don't memorize the same content people do. This is why LLMs "have hallucinations", "don't follow instructions", "are censored", and "makes common sense mistakes" (these are words people use to characterize LLMs).
> nothing of what everyone does with LLMs daily would ever work
It "works" in the sense that the LLM's output serves a purpose designated by the people. LLMs "work" for certain tasks and don't "work" for others. "Working" doesn't require reasoning from an LLM, any tool can "work" well for certain tasks when used by the people.
> averaging the results should help at least a little
Averaging the LLM grading just exacerbates the illusion of LLM reasoning. It only confuses people. Would you ask your hammer to grade how well scissors cut paper? You could do that, and the hammer would say it gets the job done but doesn't cut well because it needs to smash the paper instead of cutting it; Your hammer's just talking in a different language. It's the same here. The LLMs output doesn't necessarily measure what the instructions in the prompt say.
> Human judges are completely inconsistent.
Humans can be inconsistent, but how well the LLM adapts to humans is itself a metric of success.