That’s a devastating benchmark design flaw. Sick of these bullshit benchmarks designed solely to hype AI. AI boosters turn around and use them as ammo, despite not understanding them.

▲

famouswaffles 18 hours ago | parent | next [-]

Relax. Anyone who's genuinely interested in the question will see with a few searches that LLMs can play chess fine, although the post-trained models mostly seem to be regressed. Problem is people are more interested in validating their own assumptions than anything else.

https://arxiv.org/abs/2403.15498

https://arxiv.org/abs/2501.17186

https://github.com/adamkarvonen/chess_gpt_eval

▲

dwohnitmok 16 hours ago | parent | prev | next [-]

> That’s a devastating benchmark design flaw

I think parent simply missed until their later reply that the benchmark includes rated engines.

▲

runarberg 18 hours ago | parent | prev [-]

I like this game between grok-4.1-fast and maia-1100 (engine, not LLM).

https://chessbenchllm.onrender.com/game/37d0d260-d63b-4e41-9...

This exact game has been played 60 thousand times on lichess. The peace sacrifice Grok performed on move 6 has been played 5 million times on lichess. Every single move Grok made is also the top played move on lichess.

This reminds me of Stefan Zweig’s The Royal Game where the protagonist survived Nazi torture by memorizing every game in a chess book his torturers dropped (excellent book btw. and I am aware I just committed Godwin’s law here; also aware of the irony here). The protagonist became “good” at chess, simply by memorizing a lot of games.

	▲	famouswaffles 18 hours ago \| parent [-]
		The LLMs that can play chess, i.e not make an illegal move every game do not play it simply by memorized plays.