Remix.run Logo
photonthug 3 days ago

Thanks for your thoughtful comment and refs to chase down.

> You just made up a definition of "understand". According to that definition, you are of course right. I just don't think it's a reasonable definition. ... Here are things people say:

Fine. As others have pointed out and I hinted at.. debating terminology is kind of a dead end. I personally don't expect that "understanding chess" is the same as "understanding Picasso", or that those phrases would mean the same thing if they were applied to people vs for AI. Also.. I'm also not personally that interested in how performance stacks up compared to humans. Even if it were interesting, the topic of human-equivalent performance would not have static expectations either. For example human-equivalent error rates in AI are much easier for me to expect and forgive in robotics than they are in axiomatic game-play.

> I am interested in what their chess abilities teaches us about how LLMs build world models

Focusing on the single datapoint that TFA is establishing: some LLMs can play some chess with some amount of expertise, with some amount of errors. With no other information at all, this tells us that it failed to model the rules, or it failed in the application of those rules, or both.

Based on that, some questions worth asking: Which one of these failure modes is really acceptable and in which circumstances? Does this failure mode apply to domains other than chess? Does it help if we give it the model directly, say by explaining the rules directly in the prompt and also explicitly stating to not make illegal moves? If it's failing to apply rules, but excels as a model-maker.. then perhaps it can spit out a model directly from examples, and then I can feed the model into a separate engine that makes correct, deterministic steps that actually honor the model?

Saying that LLMs do or don't understand chess is lazy I guess. My basic point is that the questions above and their implications are so huge and sobering that I'm very uncomfortable with premature congratulations and optimism that seems to be in vogue. Chess performance is ultimately irrelevant of course, as you say, but what sits under the concrete question is more abstract but very serious. Obviously it is dangerous to create tools/processes that work "most of the time", especially when we're inevitably going to be giving them tasks where we can't check or confirm "legal moves".