| ▲ | ranyume a day ago | |||||||
> What does that leave us with? At the start, with no benchmark. Because LLMs can't reason at this time, and because we don't have a reliable way of grading LLM reasoning, and because people are stubborn thinking LLMs are actually reasoning we're at the start. When you ask a LLM "2 + 2 = ", it doesn't add the numbers together, it just looks up one of the stories it memorized and return what happens next. Probably in some such stories 2 + 2 = fish. Similarly, when you're asking a LLM to grade another LLM, it's just looking up what happens next in it's stories, not even following instructions. "Following" instructions requires thinking, hence it's not even following instructions. But you can say you're commanding the LLM, or programming the LLM, so you have full responsibility for what the LLM produces, and the LLM has no authorship. Put in another way, the LLM cannot make something you yourself can't... at this point, in which it can't reason. | ||||||||
| ▲ | stevenhuang 17 hours ago | parent | next [-] | |||||||
You have an outmoded understanding of how LLMs work (flawed in ways that are "not even wrong"), a poor ontological understanding of what reasoning even is, and too certain that your answers to open questions are the right ones. | ||||||||
| ||||||||
| ▲ | moffkalast 18 hours ago | parent | prev [-] | |||||||
That's kind of nonsense, since if I ask you what's five times six, you don't do the math in your head, you spit out the value of the multiplication table you memorized in primary school. Doing the math on paper is tool use, which models can easily do too if you give them the option, writing adhoc python scripts to run the math you ask them to with exact results. There is definitely a lot of generalization going on beyond just pattern matching, otherwise practically nothing of what everyone does with LLMs daily would ever work. Although it's true that the patterns drive an extremely strong bias. Arguably if you're grading LLM output, which by your definition cannot be novel, then it doesn't need to be graded with something that can. The gist of this grading approach is just giving them two examples and asking which is better, so it's completely arbitrary, but the grades will be somewhat consistent and running it with different LLM judges and averaging the results should help at least a little. Human judges are completely inconsistent. | ||||||||
| ||||||||