Remix.run Logo
Workaccount2 5 days ago

I think the problem is though that they need to store it in text context.

When I am solving a sokoban style game, it's entirely visual. I don't need to remember a lot because the visual holds so much information.

It's like the average person trying to play a game of chess with just text. It's nightmarishly hard compared to having a board in front of you. The LLMs seem stuck having to play everything through just text.

modeless 5 days ago | parent [-]

It's not just visual. You also need a representation of the rules of the game and the strategies that make sense. The puzzles I'm solving are not straight Sokoban, they have per-game varying rules that need to be discovered (again, ARC-AGI-3 style) that affect the strategies that you need to use. For example, in classic Sokoban you can't push two crates at once, but in some of the puzzles I'm using you can, and this is taught by forcing you to do it in the first level, and you need to remember it through the rest of the levels. This is not a purely visual concept and models still struggle with it.