I really think that the problem is with tokenizing vision.

Any kind of visually based reasoning and they become dumb as rocks. It feels similar to having a person play sokoban but blindfolded and only with text prompts. The same issue cropped up with playing pokemon. Like the image gets translated to text, and then the model works on that.

I'm no expert on transformers, but it just feels like there is some kind of limit that prevents the models from "thinking" visually.

▲

modeless 5 days ago | parent [-]

Yes, vision is a problem, but I don't think it's the biggest problem for the specific task I'm testing. The memory problem is bigger. The models frequently do come up with the right answer, but they promptly forget it between turns.

Sometimes they forget because the full reasoning trace is not preserved in context (either due to API limitations or simply because the context isn't big enough to hold dozens or hundreds of steps of full reasoning traces). Sometimes it's because retrieval from context is bad for abstract concepts and rules vs. keyword matching, and to me the reason for that is that text is lossy and inefficient. The models need to be able to internally store and retrieve a more compact, abstract, non-verbal representation of facts and procedures.

▲

Workaccount2 5 days ago | parent [-]

I think the problem is though that they need to store it in text context.

When I am solving a sokoban style game, it's entirely visual. I don't need to remember a lot because the visual holds so much information.

It's like the average person trying to play a game of chess with just text. It's nightmarishly hard compared to having a board in front of you. The LLMs seem stuck having to play everything through just text.

	▲	modeless 5 days ago \| parent [-]
		It's not just visual. You also need a representation of the rules of the game and the strategies that make sense. The puzzles I'm solving are not straight Sokoban, they have per-game varying rules that need to be discovered (again, ARC-AGI-3 style) that affect the strategies that you need to use. For example, in classic Sokoban you can't push two crates at once, but in some of the puzzles I'm using you can, and this is taught by forcing you to do it in the first level, and you need to remember it through the rest of the levels. This is not a purely visual concept and models still struggle with it.