Remix.run Logo
porridgeraisin 2 days ago

Tokens correspond more to words in text land. The cat-feline etc connection happens when you train the model and not really by the tokenisation algorithm, which only sees text and not concepts. Byte pair encoding and SentencePiece (the two main tokenisation algorithms used by all LLMs) are mostly leetcode-medium-level algorithms. You can check it out and gain a permanent intuition, especially BPE.

For images, you take patches of the image (say 16x16 patches[1]), and then directly pass it into the FFN+transformer machinery[2]. As such, there is no vocabulary of tokens for images[3]. So, the billing happens per image patch. i.e, for large images, your cost will go up[2] Since it will have more px*py patches.

[1] x 3, due to RGB

[2] Upto a point, it gets downsamples to lower quality beyond a certain res. The downsampling happens in many ways... Qwen-VL uses a CNN, GPT iirc stuffs a downsampler after the embedding layer... As well as before. Anyways, they usually just take some average reduction by that downsampler and cut your billed tokens by that much in all these cases. OpenAIs bin-based billing is like this.

[3] Dall-E from way back when did have a discrete set of tokens and it mapped all patches of all images in the world to one from that, IIRC.