Remix.run Logo
dragonwriter 2 days ago

> Why: LLMs are increasingly becoming multimodal, so an image "token" or video "token" is not as simple as a text token.

For autoregressive token-based multimodal models, image tokens are as straightforward as text tokens, and there is no reason video tokens wouldn’t also be. (If models also switch architecture and multimodal diffusion models, say, become more common, then, sure, a different pricing model more tied to actual compute cost drivers for that architecture are likely but... even that isn’t likely to be bytes.)

> Also, it's difficult to compare across competitors because tokenization is different.

That’s a reason for incumbents to prefer not to switch, though, not a reason for them to switch.

> Eventually prices will just be in $/Mb of data processed.

More likely they would be in floatint point operations expended processing them, but using tokens (which are the primary drivers for the current LLM architectures) will probably continue as long as the architecture itself is doninant.

oblio 2 days ago | parent | next [-]

> For autoregressive token-based multimodal models, image tokens are as straightforward as text tokens, and there is no reason video tokens wouldn’t also be.

In classical computing, there is a clear hierarchy: text < images <<< video.

Is there a reason why video computing using LLMs shouldn't be much more intensive and therefore costly than text or image output?

Filligree 2 days ago | parent | next [-]

Of course it’s more expensive. It’s still tokens, but considerably more of them.

oblio 2 days ago | parent [-]

That's the thing, I can't visualize (and I don't think most people can) what "tokens" represent for image or video outputs.

For text I just assume them to be word stems or more like work-family-members (cat-feline-etc).

For images and videos I guess each character, creature, idea in it is a token? Blue sky, cat walking around, gentleman with a top hat, multiplied by the number of frames?

dragonwriter 2 days ago | parent | next [-]

> For images and videos I guess each character, creature, idea in it is a token?

No, for images, tokens would, I expect, usually be asymptotically proportional to the area of the image (this is certainly the case with input token for OpenAIs models that take image inputs; outputs are more opaque); you probably won’t have a neat one-to-one intuition for what one token represents, but you don’t need that for it to be useful and straightforward for understanding pricing, since the mathematical relationship of tokens to size can be published and the size of the image is a known quantity. (And videos conceptually could be like images with an additional dimension.)

porridgeraisin 2 days ago | parent | prev [-]

Tokens correspond more to words in text land. The cat-feline etc connection happens when you train the model and not really by the tokenisation algorithm, which only sees text and not concepts. Byte pair encoding and SentencePiece (the two main tokenisation algorithms used by all LLMs) are mostly leetcode-medium-level algorithms. You can check it out and gain a permanent intuition, especially BPE.

For images, you take patches of the image (say 16x16 patches[1]), and then directly pass it into the FFN+transformer machinery[2]. As such, there is no vocabulary of tokens for images[3]. So, the billing happens per image patch. i.e, for large images, your cost will go up[2] Since it will have more px*py patches.

[1] x 3, due to RGB

[2] Upto a point, it gets downsamples to lower quality beyond a certain res. The downsampling happens in many ways... Qwen-VL uses a CNN, GPT iirc stuffs a downsampler after the embedding layer... As well as before. Anyways, they usually just take some average reduction by that downsampler and cut your billed tokens by that much in all these cases. OpenAIs bin-based billing is like this.

[3] Dall-E from way back when did have a discrete set of tokens and it mapped all patches of all images in the world to one from that, IIRC.

dragonwriter 2 days ago | parent | prev [-]

No, it'll certainly be more expensive in any conceivable model that handles all three modalities, but if the model uses an architecture like current autoregressive, token-based multimodal LLMs/VLMs, tokens will make just as much sense as the basis for pricing, and be similarly straightforward, as with text and images.

efskap 2 days ago | parent | prev [-]

To clarify, "as straightforward" = same dimensionality? I guess it would have to be, to be usable in the same embedding space.