Remix.run Logo
stingraycharles an hour ago

> I guess they do "see" but more like "see an explanation of the image", not "see" as in experience visually.

Images are tokenized and fed to the exact same model, they can “visually inspect” images, eg “find the 2 differences between two images” and “where’s Waldo”-style things.

So your mental model that they see descriptions is inaccurate.