▲ | joegibbs 6 hours ago | |
I remember that a couple of years ago people were talking about how multimodal models would have skills bleed-over, so one that's trained on the same amount of text + a ton of video/image data would perform better on text responses. Did this end up holding up? Intuitively I would think that text packs much more meaning into the same amount of data than visuals do (a single 1000x1000px image would be about the same amount of data as a million characters), which would hamstring it. |