The post is based on a misconception. If you read the blog post linked at the end of this message, you'll see how a very small GPT-2 alike transformer (Karpathy nano-gpt trained to a very small size) after seeing just PGN games and nothing more develops an 8x8 internal representation with which chess piece is where. This representation can be extracted by linear probing (and can be even altered by using the probe in reverse). LLMs are decent but not very good chess players for other reasons, not because they don't have a world model of the chess board.

https://www.lesswrong.com/posts/yzGDwpRBx6TEcdeA5/a-chess-gp...

▲

thecupisblue 7 days ago | parent | next [-]

Ironically, that lesswrong article is more wrong than right.

First, chess is perfect for such modeling. The game is basically a tree of legal moves. The "world model" representation is already encoded in the dataset itself and at a certain scale the chance of making an illegal move is minimal, as the dataset itself includes an insane amount of legal moves compared to illegal moves, let alone when you are training it on a chess dataset like PGN one

Second, the probing is quite... a subjective thing.

We are cherry-picking activations across an arbitrary amount of dimensions, on a model specifically trained for chess, taking these arbitrary representations and displaying it on 2D graph.

Well yeah, with enough dimensions and cherry-picking, we can also show how "all zebras are elephants, because all elephants are horses and look their weights overlap in so many dimensions - large four-legged animals you see on safari!" - especially if we cherry-pick it. Especially if we tune a dataset on it.

This shows nothing other than "training LLMs on a constrained move dataset makes LLM great at predicting next move in that dataset".

▲

flender 7 days ago | parent [-]

And if it knew every possible board configuration and optimal move, it could potentially do as well as it could, but instead if it were to just recognize “this looks like a chess game” and use an optimized tool to determine the next move, that would be a better use of training, it would seem.

	▲	thecupisblue 7 days ago \| parent [-]
		Way better use, at this point that engine is more like a world's most expensive monte carlo search.

▲

yosefk 7 days ago | parent | prev [-]

The post or rather the part you refer to is based on a simple experiment which I encourage you to repeat. (It is way likelier to reproduce in the short to medium run than the others.)

From your link: "...The first was gpt-3.5-turbo-instruct's ability to play chess at 1800 Elo"

These things don't play at 1800 ELO, though maybe someone measured this ELO without cheating but rather relying on some artifacts of how an engine told to play at a low rating does against an LLM (engines are weird when you ask them to play badly, as a rule); a good start to a decent measurement would be to try it on chess 960. These things do lose track of the pieces in 10 moves. (As do I absent a board to look at, but I understand enough to say "I can't play blindfold chess, let's set things up so I can look at the current position somehow")

	▲	og_kalu 7 days ago \| parent [-]
		>These things don't play at 1800 ELO Why are you saying 'these things'?. That statement is about a specific model which did play at that level and did not lose track of the pieces. There's no cheating or weirdness. https://github.com/adamkarvonen/chess_gpt_eval