Remix clone Hacker News

Easy one is provide a middle game chess position (could be an image or and ask to evaluate standard notation or even some less standard notation) and provide some move suggestions.

Unless the model incorporates an actual chess engine (Fritz 5.32 from 1998 would suffice) it will not do well.

I am a reasonably skilled player (FM) so can evaluate way better than LLMs. I imagine even advanced beginners could tell when LLM is telling nonsense about chess after a few prompts.

Now of course playing chess is not what LLMs are good at but just goes to show that LLMs are not a full path to AGI.

Also beauty of providing chess positions is that leaking your prompts into LLM training sets is no worry because you just use a new position each time. Little worry of running out of positions...

▲

rmorey 2 hours ago | parent | next [-]

I was going to suggest chess position recognition, AFAIK it's a completely unsolved computer vision task (once a position is recognized, I think analysis is well solved by, say, a stockfish tool for the LLM, but there is interesting work going on with language models themselves understanding chess)

▲

helloplanets 10 hours ago | parent | prev [-]

I wonder how much fine tuning against something like Stockfish top moves would help a model in solving novel middle game positions. Something like this format: https://database.lichess.org/#evals

I'd be pretty surprised if it did help in novel positions. Which would make this an interesting LLM benchmark honestly: Beating Stockfish from random (but equal) middle game positions. Or to mix it up, from random Chess960 positions.

Of course, the basis of the logic the LLM would play with would come from the engine used for the original evals. So beating Stockfish from a dataset based on Stockfish evals would seem completely insufficient.

	▲	ActivePattern 2 hours ago \| parent [-]
		I am quite confident that an LLM will never beat a top chess engine like Stockfish. An LLM is a generalist -- it contains a lot of world knowledge, and nearly all of it is completely irrelevant to chess. Stockfish is a specialist tuned specifically to chess, and hence able to spend its FLOPs much more efficiently towards finding the best move. The most promising approach would be tune a reasoning LLM on chess via reinforcement learning, but fundamentally, the way an LLM reasons (i.e. outputting a stream of language tokens) is so much more inefficient than the way a chess engine reasons (direct search of the game tree).