Suppose you're right, the internal model of game rules is perfect but the application of the model for next-move is imperfect. Unless we can actually separate the two, does it matter? Functionally I mean, not philosophically. If the model was correct, maybe we could get a useful version of it out by asking it to write a chess engine instead of act as a chess engine. But when the prolog code for that is as incorrect as the illegal chess move was, will you say again that the model is correct, but the usage of it resulted merely resulted in minor errors?
> You are saying the LLM is making a model error (rather than an an application error) only because of preconceived notions of how 'machines' must behave, not on any rigorous examination.
Here's an anecdotal examination. After much talk about LLMs and chess, and math, and formal logic here's the state of the art, simplified from dialog with gpt today:
> blue is red and red is blue. what color is the sky?
>> <blah blah, restates premise, correctly answer "red">
At this point fans rejoice, saying it understands hypotheticals and logic. Dialogue continues..
> name one red thing
>> <blah blah, restates premise, incorrectly offers "strawberries are red">
At this point detractors rejoice, declare that it doesn't understand. Now the conversation devolves into semantics or technicalities about prompt-hacks, training data, weights. Whatever. We don't need chess. Just look it, it's broken as hell. Discussing whether the error is human-equivalent isn't the point either. It's broken! A partially broken process is no solid foundation to build others on. And while there are some exceptions, an unreliable tool/agent is often worse than none at all.