▲ | modeless 5 days ago | |||||||
Yes, vision is a problem, but I don't think it's the biggest problem for the specific task I'm testing. The memory problem is bigger. The models frequently do come up with the right answer, but they promptly forget it between turns. Sometimes they forget because the full reasoning trace is not preserved in context (either due to API limitations or simply because the context isn't big enough to hold dozens or hundreds of steps of full reasoning traces). Sometimes it's because retrieval from context is bad for abstract concepts and rules vs. keyword matching, and to me the reason for that is that text is lossy and inefficient. The models need to be able to internally store and retrieve a more compact, abstract, non-verbal representation of facts and procedures. | ||||||||
▲ | Workaccount2 5 days ago | parent [-] | |||||||
I think the problem is though that they need to store it in text context. When I am solving a sokoban style game, it's entirely visual. I don't need to remember a lot because the visual holds so much information. It's like the average person trying to play a game of chess with just text. It's nightmarishly hard compared to having a board in front of you. The LLMs seem stuck having to play everything through just text. | ||||||||
|