▲ | gwern 3 days ago | |
I think this is a little paranoid. No one is training extremely large expensive LLMs on huge datasets in the hope that a blogger will stumble across poor 1800 Elo performance and tweet about it! 'Chess' is not a standard LLM benchmark worth Goodharting; OA has generally tried to solve problems the right way rather than by shortcuts & cheating, and the GPTs have not heavily overfit on the standard benchmarks or counterexamples that they so easily could which would be so much more valuable PR (imagine how trivial it would be to train on, say, 'the strawberry problem'?), whereas some other LLM providers do see their scores drop much more in anti-memorization papers; they have a clear research use of their own in that very paper mentioning the dataset; and there is some interest in chess as a model organism of supervision and world-modeling in LLMs because we have access to oracles (and it's less boring than many things you could analyze), which explains why they would be doing some research (if not a whole lot). Like the bullet chess LLM paper from Deepmind - they aren't doing that as part of a cunning plan to make Gemini cheat on chess skills and help GCP marketing! |