▲ | Workaccount2 2 days ago | |
I'd take text-to-image capabilities with a grain of salt, because they are dramatically lower than their text to text abilities. I don't know the exact mechanics with current multimodal models, but it is pretty clear that there is a disconnect between what the text modal wants, and what the text model outputs. It's almost feels like asking someone with a blindfold to draw a cat, you kinda get a mess. If you ask chatgpt to describe a new image based off an input image, it will do dramatically better. But ask it to use it's image generation tooling and the "awareness" judged by the image output falls off a cliff. Another example is infographics or flow charts. The models can easily output that information and put it in a nicely formatted text grid for you. But ask them to put it in a visual image, and it's just a mess. I don't think it's the models, I think it's the text-image translation layer. | ||
▲ | weinzierl 2 days ago | parent [-] | |
This is a good point. The 2D birds eye view image adds another separate complication. There are certainly better and more direct ways to show that current models are bad with spatial reasoning. This was just a byproduct of my geolocation experiments. Maybe I will give it a shot another day. |