Remix clone Hacker News

new | show | ask | jobs Github

	▲	joegibbs 6 hours ago
		I remember that a couple of years ago people were talking about how multimodal models would have skills bleed-over, so one that's trained on the same amount of text + a ton of video/image data would perform better on text responses. Did this end up holding up? Intuitively I would think that text packs much more meaning into the same amount of data than visuals do (a single 1000x1000px image would be about the same amount of data as a million characters), which would hamstring it.