I've been testing LLMs on Sokoban-like puzzles (in the style of ARC-AGI-3) and they are completely awful at them. It really highlights how poor their memory is. They can't remember abstract concepts or rules between steps, even if they discover them themselves. They can only be presented with lossy text descriptions of such things which they have to re-read and re-interpret at every step.

LLMs are completely helpless on agentic tasks without a ton of scaffolding. But the scaffolding is inflexible and brittle, unlike the models themselves. Whoever figures out how to reproduce the functions of this type of scaffolding within the models, with some kind of internal test-time-learned memory mechanism, is going to win.

▲

sunrunner 5 days ago | parent | next [-]

I'm not sure how similar this is but I tried the same quite a while back with a simple 5x5 nonogram (Picross) and had similar difficulties.

I found not only incorrect 'reasoning' but also even after being explicit about why a certain deduction was not correct the same incorrect deduction would then appear later, and this happened over and over.

Also, there's already a complete database of valid answers at [1], so I'm not sure why the correct answer couldn't just come from that, and the 'reasoning' can be 'We solved this here, look...' ;)

[1] The wonderful https://pixelogic.app/every-5x5-nonogram

	▲	Akronymus 5 days ago \| parent [-]
		> I found not only incorrect 'reasoning' but also even after being explicit about why a certain deduction was not correct the same incorrect deduction would then appear later, and this happened over and over. Because its in the context window and a lot of training material refers to earlier stuff for later stuff it is trained to bring up that stuff again and again. Even if it is in the window as a negative.

▲

Workaccount2 5 days ago | parent | prev | next [-]

I really think that the problem is with tokenizing vision.

Any kind of visually based reasoning and they become dumb as rocks. It feels similar to having a person play sokoban but blindfolded and only with text prompts. The same issue cropped up with playing pokemon. Like the image gets translated to text, and then the model works on that.

I'm no expert on transformers, but it just feels like there is some kind of limit that prevents the models from "thinking" visually.

▲

modeless 5 days ago | parent [-]

Yes, vision is a problem, but I don't think it's the biggest problem for the specific task I'm testing. The memory problem is bigger. The models frequently do come up with the right answer, but they promptly forget it between turns.

Sometimes they forget because the full reasoning trace is not preserved in context (either due to API limitations or simply because the context isn't big enough to hold dozens or hundreds of steps of full reasoning traces). Sometimes it's because retrieval from context is bad for abstract concepts and rules vs. keyword matching, and to me the reason for that is that text is lossy and inefficient. The models need to be able to internally store and retrieve a more compact, abstract, non-verbal representation of facts and procedures.

▲

Workaccount2 5 days ago | parent [-]

I think the problem is though that they need to store it in text context.

When I am solving a sokoban style game, it's entirely visual. I don't need to remember a lot because the visual holds so much information.

It's like the average person trying to play a game of chess with just text. It's nightmarishly hard compared to having a board in front of you. The LLMs seem stuck having to play everything through just text.

	▲	modeless 5 days ago \| parent [-]
		It's not just visual. You also need a representation of the rules of the game and the strategies that make sense. The puzzles I'm solving are not straight Sokoban, they have per-game varying rules that need to be discovered (again, ARC-AGI-3 style) that affect the strategies that you need to use. For example, in classic Sokoban you can't push two crates at once, but in some of the puzzles I'm using you can, and this is taught by forcing you to do it in the first level, and you need to remember it through the rest of the levels. This is not a purely visual concept and models still struggle with it.

▲

M4v3R 5 days ago | parent | prev | next [-]

I wonder scaffolding synthesis is the way to go. Namely the LLM itself first reasons about the problem and creates scaffolding for a second agent that will do the actual solving. All inside a feedback loop to adjust the scaffolding based on results.

▲

modeless 5 days ago | parent | next [-]

In general I think the more of the scaffolding that can be folded into the model, the better. The model should learn problem solving strategies like this and be able to manage them internally.

▲

sixo 5 days ago | parent | prev | next [-]

I toyed around with the idea of using an LLM to "compile" user instructions into a kind of AST of scaffolding, which can then be run by another LLM. It worked fairly wellbfor the kind of semi-structured tasks LLMs choke on like "for each of 100 things, do...", but I haven't taken it beyond a minimal impl.

	▲	harshitaneja 5 days ago \| parent [-]
		I am working on something similar but with an AST for legal documents. So far, it seems promising but still rudimentary.

▲

plantain 5 days ago | parent | prev [-]

If you've ever used Claude Code + Plan mode - you know that exactly this is true.

▲

low_tech_love 5 days ago | parent | prev [-]

Try to get your LLM of choice to find its way out of a labyrinth that you describe in text form. It's absolutely awful even with the simplest mazes. I'm not sure the problem here is memory, though? I think it has to do with spatial reasoning. I'd be willing to bet every company right now is working on spatial reasoning (at least up to 3D) and as soon as that is working, a huge amount of pieces will fall into place.

	▲	modeless 5 days ago \| parent [-]
		Spatial reasoning is weak, but still I frequently see models come up with the right answer in reasoning steps, only to make the wrong move in the following turn because they forget what they just learned. For models with hidden reasoning it's often not even possible to retain the reasoning tokens in context through multiple steps, but even if you could the context windows are big but not big enough to contain all the past reasoning for every step for hundreds of steps. And then even if they were the retrieval from context for abstract concepts (vs verbatim copying) is terrible. Text is too lossy and inefficient. The models need to be able to internally store and retrieve a more compact, abstract, non-verbal representation of facts and procedures.