LLMs have a very bloated in-memory representation for text, on the order of megabytes of KV cache per byte of text. Meanwhile, for images a lossy representation is considered acceptable and it only takes up maybe a kilobyte of KV cache per byte of image. So if you can render your text into a hundred bytes of image per byte of text and then lossily expand it into 100 kB of KV cache per byte of text, you come out ahead!

Whether such lossy compression is acceptable for your use case is up to you.

▲

Taek 3 hours ago | parent | next [-]

I don't think it's that bad, if I recall correctly it's about 8 kilobytes per token, and a token can be 3-4 characters so you're talking ~2 kilobytes per character.

An image token I recall is something like 16x16, so you get 32 bytes of overhead per pixel. And a character is minimally like 20 pixels including the whitespace, so you've jumped from 4 characters per token to maybe 12.

So 3x savings... which actually maps pretty closely to 60% savings.

▲

esafak 2 hours ago | parent | prev [-]

Can we get some refs for this number? If true it sounds like poor design.

	▲	Tuna-Fish 2 hours ago \| parent [-]
		It's not quite as bad as the parent made it out to be, the largest I've seen is 32kB per token (where sometimes, a token represents a byte, but usually it represents more than one.) It's forced by the nature of how LLMs use vector embeddings for language. Basically, a single token in a LLM is represented as a n-element vector, where n is the "hidden dimension", also known as model dimension. In order for the model to be smart, the hidden dimension needs to be large, on the order of 2^16 on top-tier models. Elements of this vector are typically quantized to 2-byte floats, or sometimes smaller. Every possible fact is embedded as a direction in this very many dimensional vector space, and a token is related to a fact if the vector representing that token points into a similar direction as that fact. You can do vector math about these things, famously for most trained models, if you find the vector embedding for king, man, woman and queen, and calculate king - man + woman, the result is very close to queen. (Does that mean that there are 2^16 possible different kinds facts about things in this model? No, because high-dimensional geometry is very unintuitively powerful. The facts are not axis-aligned, and they don't need to be perfectly non-orthogonal. This matters, because the numbers of individual vectors you can fit into a single 2^16 dimensional space that are orthogonal with each other (all angles 90degrees) is of course 2^16. But, if you allow for almost orthogonal vectors, the number is larger than the amount of atoms in the universe. If this sounds wacky, for people with a CS background it can help to think it working a bit like a bloom filter, in that collisions are possible. Although in actuality they are theoretical, because 2^16 is a very large number.)