Remix.run Logo
AIPedant 3 days ago

I am less interested in questioning training data corruption than I am in questioning claims like this:

  test reasoning abilities such as pattern recognition, lateral thinking, abstraction, contextual reasoning (accounting for British cultural references), and multi-step inference.... its emphasis on clever reasoning rather than knowledge recall, Only Connect provides an ideal challenge for benchmarking LLMs' reasoning capabilities.
It seems to me that the null hypothesis should be "LLMs are probabilistic next-word generators and might be able to solve a lot of this stuff with shallow surface statistics built from inhumanly large datasets, without ever properly using abstraction, contextual reasoning, etc." This is particularly true for NYT Connections, but in general evaluations like this seem to be at least partially testing how amenable certain word/trivia games are to naive statistical algorithms. (Many NYT Connections "purple" categories seem like they would be quite obvious to a next n-gram calculator, but not for people who actually use words conversationally!) Humans don't use these statistical algorithms for reasoning except in particular circumstances (many use "folk n-gram statistics" when playing Wordle; poker; serious word game players often learn more detailed tables of info; you could see competitive NYT Connections players learning a giant bag of statistical heuristics to help them speedrun things). We just can't accumulate the data ourselves without making a concerted computer-aided effort.

In general a lot of LLM benchmarks don't adequately consider that LLMs can solve certain things better than humans without using reasoning or knowledge. The most stupid example is how common multiple choice benchmarks are, despite us all learning as children that multiple-choice questions can be partially gamed with shallow statistical-linguistic tricks even if you have no clue how to answer the question honestly[1]; it stands to reason that a superhuman statistical-linguistic computer could accumulate superhuman statistical-linguistic tricks without ever properly learning the subject matter. AI folks have always been quick to say "if it quacks like a duck it reasons like a duck" but these days computers are quite good at playing duck recordings.

[1] "When in doubt, C your way out," sniffing out suspicious answers, shallow pattern-matching to answer reading comprehension, etc etc. One thing humans and LLMs actually do have in common is that multiple-choice tests are terrible ways to assess their knowledge or intelligence.