Remix.run Logo
oblio 2 days ago

That's the thing, I can't visualize (and I don't think most people can) what "tokens" represent for image or video outputs.

For text I just assume them to be word stems or more like work-family-members (cat-feline-etc).

For images and videos I guess each character, creature, idea in it is a token? Blue sky, cat walking around, gentleman with a top hat, multiplied by the number of frames?

dragonwriter 2 days ago | parent | next [-]

> For images and videos I guess each character, creature, idea in it is a token?

No, for images, tokens would, I expect, usually be asymptotically proportional to the area of the image (this is certainly the case with input token for OpenAIs models that take image inputs; outputs are more opaque); you probably won’t have a neat one-to-one intuition for what one token represents, but you don’t need that for it to be useful and straightforward for understanding pricing, since the mathematical relationship of tokens to size can be published and the size of the image is a known quantity. (And videos conceptually could be like images with an additional dimension.)

porridgeraisin 2 days ago | parent | prev [-]

Tokens correspond more to words in text land. The cat-feline etc connection happens when you train the model and not really by the tokenisation algorithm, which only sees text and not concepts. Byte pair encoding and SentencePiece (the two main tokenisation algorithms used by all LLMs) are mostly leetcode-medium-level algorithms. You can check it out and gain a permanent intuition, especially BPE.

For images, you take patches of the image (say 16x16 patches[1]), and then directly pass it into the FFN+transformer machinery[2]. As such, there is no vocabulary of tokens for images[3]. So, the billing happens per image patch. i.e, for large images, your cost will go up[2] Since it will have more px*py patches.

[1] x 3, due to RGB

[2] Upto a point, it gets downsamples to lower quality beyond a certain res. The downsampling happens in many ways... Qwen-VL uses a CNN, GPT iirc stuffs a downsampler after the embedding layer... As well as before. Anyways, they usually just take some average reduction by that downsampler and cut your billed tokens by that much in all these cases. OpenAIs bin-based billing is like this.

[3] Dall-E from way back when did have a discrete set of tokens and it mapped all patches of all images in the world to one from that, IIRC.