| ▲ | moffkalast a day ago | |||||||||||||||||||||||||||||||
Well if lmsys showed anything, it's that human judges are measurably worse. Then you have your run of the mill multiple choice tests that grade models on unrealistic single token outputs. What does that leave us with? | ||||||||||||||||||||||||||||||||
| ▲ | ranyume 19 hours ago | parent | next [-] | |||||||||||||||||||||||||||||||
> What does that leave us with? At the start, with no benchmark. Because LLMs can't reason at this time, and because we don't have a reliable way of grading LLM reasoning, and because people are stubborn thinking LLMs are actually reasoning we're at the start. When you ask a LLM "2 + 2 = ", it doesn't add the numbers together, it just looks up one of the stories it memorized and return what happens next. Probably in some such stories 2 + 2 = fish. Similarly, when you're asking a LLM to grade another LLM, it's just looking up what happens next in it's stories, not even following instructions. "Following" instructions requires thinking, hence it's not even following instructions. But you can say you're commanding the LLM, or programming the LLM, so you have full responsibility for what the LLM produces, and the LLM has no authorship. Put in another way, the LLM cannot make something you yourself can't... at this point, in which it can't reason. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
| ▲ | sbierwagen a day ago | parent | prev [-] | |||||||||||||||||||||||||||||||
Seems like a foreshock of AGI if the average human is no longer good enough to give feedback directly and the nets instead have to do recursive self improvement themselves. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||