| ▲ | yorwba 3 hours ago | |||||||
LLMs have a very bloated in-memory representation for text, on the order of megabytes of KV cache per byte of text. Meanwhile, for images a lossy representation is considered acceptable and it only takes up maybe a kilobyte of KV cache per byte of image. So if you can render your text into a hundred bytes of image per byte of text and then lossily expand it into 100 kB of KV cache per byte of text, you come out ahead! Whether such lossy compression is acceptable for your use case is up to you. | ||||||||
| ▲ | Taek 3 hours ago | parent | next [-] | |||||||
I don't think it's that bad, if I recall correctly it's about 8 kilobytes per token, and a token can be 3-4 characters so you're talking ~2 kilobytes per character. An image token I recall is something like 16x16, so you get 32 bytes of overhead per pixel. And a character is minimally like 20 pixels including the whitespace, so you've jumped from 4 characters per token to maybe 12. So 3x savings... which actually maps pretty closely to 60% savings. | ||||||||
| ▲ | esafak 2 hours ago | parent | prev [-] | |||||||
Can we get some refs for this number? If true it sounds like poor design. | ||||||||
| ||||||||