▲ | Workaccount2 5 days ago | ||||||||||||||||
I really think that the problem is with tokenizing vision. Any kind of visually based reasoning and they become dumb as rocks. It feels similar to having a person play sokoban but blindfolded and only with text prompts. The same issue cropped up with playing pokemon. Like the image gets translated to text, and then the model works on that. I'm no expert on transformers, but it just feels like there is some kind of limit that prevents the models from "thinking" visually. | |||||||||||||||||
▲ | modeless 5 days ago | parent [-] | ||||||||||||||||
Yes, vision is a problem, but I don't think it's the biggest problem for the specific task I'm testing. The memory problem is bigger. The models frequently do come up with the right answer, but they promptly forget it between turns. Sometimes they forget because the full reasoning trace is not preserved in context (either due to API limitations or simply because the context isn't big enough to hold dozens or hundreds of steps of full reasoning traces). Sometimes it's because retrieval from context is bad for abstract concepts and rules vs. keyword matching, and to me the reason for that is that text is lossy and inefficient. The models need to be able to internally store and retrieve a more compact, abstract, non-verbal representation of facts and procedures. | |||||||||||||||||
|