Meanwhile even the highest ranked models can’t do simple logic tasks. GothamChess on YouTube did some tests where he played against a bunch of the best models and every single one of them failed spectacularly.

They’d happily lose a queen to take a pawn. They failed to understand how pieces are even allowed to move, hallucinated the existence of new pieces, repeatedly declared checkmate when it wasn’t, etc.

I tried it last night with Gemini 2.5 Pro and it made it 6 turns before it started making illegal moves, and 8 turns before it got so confused about the state of the board before it refused to play with me any longer.

I was in the chess club in 3rd grade. One of the top ranked LLMs in the world is vastly dumber than I was in 3rd grade. But we’re going to pour hundreds of billions into this in the hope that it can end my career? Good luck with that, guys.

▲

JFingleton 3 months ago | parent | next [-]

I'm not sure why people are expecting a language model to be great at chess. Remember they are trained on text, which is not the best medium for representing things like a chess board. They are also "general models", with limited training on pretty much everything apart from human language.

An Alpha Star type model would wipe the floor at chess.

▲

actsasbuffoon 3 months ago | parent | next [-]

This misses the point. LLMs will do things like move a knight by a single square as if it were a pawn. Chess is an extremely well understood game, and the rules about how things move is almost certainly well-represented in the training data.

These models cannot even make legal chess moves. That’s incredibly basic logic, and it shows how LLMs are still completely incapable of reasoning or understanding. Many kinds of task are never going to be possible for LLMs unless that changes. Programming is one of those tasks.

▲

og_kalu 3 months ago | parent | next [-]

>These models cannot even make legal chess moves. That’s incredibly basic logic, and it shows how LLMs are still completely incapable of reasoning or understanding.

Yeah they can. There's a link I shared to prove it which you've conveniently ignored.

LLMs learn by predicting, failing and getting a little better, rinse and repeat. Pre-training is not like reading a book. LLMs trained on chess games play chess just fine. They don't make the silly mistakes you're talking about and they very rarely make illegal moves.

There's gpt-3.5-turbo-instruct which i already shared and plays at around 1800 ELO. Then there's this grandmaster level chess transformer - https://arxiv.org/abs/2402.04494. They're also a couple of models that were trained in the Eleuther AI discord that reached about 1100-1300 Elo.

I don't know what the peak of LLM Chess playing looks like but this is clearly less of a 'LLMs can't do this' problem and more 'Open AI/Anthropic/Google etc don't care if their models can play Chess or not' problem.

So are they capable of reasoning now or would you like to shift the posts ?

▲

int_19h 3 months ago | parent [-]

I think the point here is that if you have to pretrain it for every specific task, it's not artificial general intelligence, by definition.

▲

og_kalu 3 months ago | parent [-]

There isn't any general intelligence that isn't receiving pre-traning. People spend 14 to 18+ years in school to have any sort of career.

You don't have to pretrain it for every little thing but it should come as no surprise that a complex non-trivial game would require it.

Even if you explained all the rules of chess clearly to someone brand new to it, it will be a while and lots of practice before they internalize it.

And like I said, LLM pre-training is less like a machine reading text and more like Evolution. If you gave a corpus of chess rules, you're only training a model that knows how to converse about chess rules.

Do humans require less 'pre-training' ? Sure, but then again, that's on the back of millions of years of evolution. Modern NNs initialize random weights and have relatively very little inductive bias.

▲

sceptic123 3 months ago | parent [-]

People are focussing on chess, which is complicated, but LLM fail at even simple games like tic-tac-toe where you'd think, if it was capable of "reasoning" it would be able to understand where it went wrong. That doesn't seem to be the case.

What it can do is write and execute code to generate the correct output, but isn't that cheating?

▲

int_19h 3 months ago | parent [-]

Which SOTA LLM fails at tic-tac-toe?

	▲	sceptic123 3 months ago \| parent [-]
		I don't know, but it's not a hard test, get the LLM to play a perfect game of tic-tac-toe against itself, look at the output and see if it goes wrong.

▲

simonw 3 months ago | parent | prev [-]

Saying programming is a task that is "never going to be possible" for an LLM is a big claim, given how many people have derived huge value from having LLMs write code for them over the past two years.

(Unless you're arguing against the idea that LLMs are making programmers obsolete, in which case I fully agree with you.)

▲

sceptic123 3 months ago | parent [-]

I think "useful as an assistant for coding" and "being able to program" are two different things.

When I was trying to understand what is happening with hallucination GPT gave me this: > It's called hallucinating when LLMs get things wrong because the model generates content that sounds plausible but is factually incorrect or made-up—similar to how a person might "see" or "experience" things that aren't real during a hallucination.

From that we can see that they fundamentally don't know what is correct. While they can get better at predicting correct answers, no-one has explained how they are expected to cross the boundary from "sounding plausible" to "knowing they are factually correct". All the attempts so far seem to be about reducing the likelihood of hallucination, not fixing the problem that they fundamentally don't understand what they are saying.

Until/unless they are able to understand the output enough to verify the truth then there's a knowledge gap that seems dangerous given how much code we are allowing "AI" to write.

▲

simonw 3 months ago | parent [-]

Code is one of the few applications of LLMs where they DO have a mechanism for verifying if what they produced is correct: they can write code, run that code, look at the output and iterate in a loop until it does what it's supposed to do.

	▲	sceptic123 3 months ago \| parent [-]
		But that requires code that is runnable and testable in isolation otherwise there are all sorts issues with that approach (aside from the obvious one of scalability) It also assumes they "understand" enough to be able to extract the correct output to test against.

▲

wijwp 3 months ago | parent | prev [-]

> I'm not sure why people are expecting a language model to be great at chess.

Because the conversation is about AGI, and how far away we are from AGI.

	▲	megablast 3 months ago \| parent [-]
		Does AGI mean good at chess? What if it is a dumb AGI?

▲

schindlabua 3 months ago | parent | prev | next [-]

Chess is not exactly a simple logic task. It requires you to keep track of 32 things in a 2d space.

I remember being extremely surprised when I could ask GPT3 to rotate a 3d model of a car in it's head and ask it about what I would see when sitting inside, or which doors would refuse to open because they're in contact with the ground.

It really depends on how much you want to shift the goalposts on what constitutes "simple".

	▲	yusina 3 months ago \| parent [-]
		> Chess is not exactly a simple logic task. Compare to what a software engineer is able to do, it is very much a simple logic task. Or the average person having a non-trivial job. Or a beehive organizing its existence, from its amino acids up to hive organization. All those things are magnitudes harder than chess. > I remember being extremely surprised when I could ask GPT3 to rotate a 3d model of a car in it's head and ask it about what I would see when sitting inside, or which doors would refuse to open because they're in contact with the ground. It's not reasoning its way there. Somebody asked something similar some time in the corpus and that corpus also contained the answers. That's why it can answer. After a quite small number of moves, the chess board it unique and you can't fake it. You need to think ahead. A task which computers are traditionally very good at. Even trained chess players are. That LLMs are not goes to show that they are very far from AGI.

▲

code_biologist 3 months ago | parent | prev | next [-]

Claude can't beat Pokemon Red. Not even close yet: https://arstechnica.com/ai/2025/03/why-anthropics-claude-sti...

▲

og_kalu 3 months ago | parent | prev [-]

LLMs can play chess fine.

The best model you can play with is decent for a human - https://github.com/adamkarvonen/chess_gpt_eval

SOTA models can't play it because these companies don't really care about it.