The size is indeed smaller, because text tokens and image tokens are embedded as vectors of the same size, but text tokens typically only cover a few characters, while image tokens typically cover many pixels, so many that you can fit more characters in there. So the same text takes up fewer tokens as an image, and hence requires less time and memory to process.

You could also imagine models where text tokens cover many characters and image tokens just a few pixels, which would invert the relationship, but this is typically suboptimal for the applications people have in mind when they train a model.

▲

jayd16 3 hours ago | parent [-]

So split the difference and start encoding input at the words or phrases level?

	▲	calebkaiser 3 hours ago \| parent [-]
		Lots of researchers have done just this! There's a really rich history of research + lots of contemporary work on different encoding/representation strategies. This might be interesting to you: https://sbert.net/ What makes the DeepSeek-OCR and related results exciting to some researchers is less about the fact that you could devise a tokenization scheme that has fewer tokens, and more about how well it works.