This is really fascinating to me. I was reading this article and originally agreed with you, "I mean, under the covers it's got to be converting to text tokens at some point, so there is no way it's actually cheaper for Claude itself to execute."

But then there is a comment below talking about how DeepSeek was able to get a huge improvement in compression by using visual tokens, https://news.ycombinator.com/item?id=48777848. I don't fully understand all of the underlying technical details so I am still fundamentally baffled about how going the OCR route could actually result in overall electricity/computational savings.

▲

stingraycharles 42 minutes ago | parent | next [-]

“I mean, under the covers it's got to be converting to text tokens at some point”

Multi-modal models do actually natively tokenize images, though. So it doesn’t have to be converted to text for it to work. They may do it anyway for accuracy, but it’s not at all required.

Effectively an image is scaled to a standard size, rasterized / cut up, and each cut is assigned a separate token, much in the same way text is tokenized. Train the model on this as well and you’ll end up having a model that can understand images.

▲

supern0va an hour ago | parent | prev | next [-]

>This is really fascinating to me. I was reading this article and originally agreed with you, "I mean, under the covers it's got to be converting to text tokens at some point, so there is no way it's actually cheaper for Claude itself to execute."

It'd be weird if they were doing this, since it would mean the context window size was a lie and that the API would presumably reject requests whose expanded form went over the 1m limit. For someone using pxpipe with an effective context compression of 90% in some instances, it'd hit the limit at barely 100k.

▲

qingcharles an hour ago | parent | prev | next [-]

I am trying to get rough summaries of long PDFs of scanned pages of text. At first I was doing OCR and passing the (tens of thousands of) characters into the LLM, which works, but it's expensive.

I asked Gemini how to save costs and it said just send in all the images of the pages instead. Instinctively, as a developer, it's hard to fathom how sending 200 images is cheaper than sending the text, but it definitely works.

▲

yorwba 3 hours ago | parent | prev | next [-]

LLMs have a very bloated in-memory representation for text, on the order of megabytes of KV cache per byte of text. Meanwhile, for images a lossy representation is considered acceptable and it only takes up maybe a kilobyte of KV cache per byte of image. So if you can render your text into a hundred bytes of image per byte of text and then lossily expand it into 100 kB of KV cache per byte of text, you come out ahead!

Whether such lossy compression is acceptable for your use case is up to you.

▲

Taek 3 hours ago | parent | next [-]

I don't think it's that bad, if I recall correctly it's about 8 kilobytes per token, and a token can be 3-4 characters so you're talking ~2 kilobytes per character.

An image token I recall is something like 16x16, so you get 32 bytes of overhead per pixel. And a character is minimally like 20 pixels including the whitespace, so you've jumped from 4 characters per token to maybe 12.

So 3x savings... which actually maps pretty closely to 60% savings.

▲

esafak 2 hours ago | parent | prev [-]

Can we get some refs for this number? If true it sounds like poor design.

	▲	Tuna-Fish 2 hours ago \| parent [-]
		It's not quite as bad as the parent made it out to be, the largest I've seen is 32kB per token (where sometimes, a token represents a byte, but usually it represents more than one.) It's forced by the nature of how LLMs use vector embeddings for language. Basically, a single token in a LLM is represented as a n-element vector, where n is the "hidden dimension", also known as model dimension. In order for the model to be smart, the hidden dimension needs to be large, on the order of 2^16 on top-tier models. Elements of this vector are typically quantized to 2-byte floats, or sometimes smaller. Every possible fact is embedded as a direction in this very many dimensional vector space, and a token is related to a fact if the vector representing that token points into a similar direction as that fact. You can do vector math about these things, famously for most trained models, if you find the vector embedding for king, man, woman and queen, and calculate king - man + woman, the result is very close to queen. (Does that mean that there are 2^16 possible different kinds facts about things in this model? No, because high-dimensional geometry is very unintuitively powerful. The facts are not axis-aligned, and they don't need to be perfectly non-orthogonal. This matters, because the numbers of individual vectors you can fit into a single 2^16 dimensional space that are orthogonal with each other (all angles 90degrees) is of course 2^16. But, if you allow for almost orthogonal vectors, the number is larger than the amount of atoms in the universe. If this sounds wacky, for people with a CS background it can help to think it working a bit like a bloom filter, in that collisions are possible. Although in actuality they are theoretical, because 2^16 is a very large number.)

▲

satvikpendem 3 hours ago | parent | prev | next [-]

Baidu released a faster OCR model as well: https://github.com/baidu/Unlimited-OCR

▲

DANmode 5 hours ago | parent | prev [-]

It wouldn’t, they’re subsidizing it for training.

Edit: didn’t realize this occurred on local models(!!),

this is smarter https://news.ycombinator.com/item?id=48779884

▲

NooneAtAll3 4 hours ago | parent [-]

can't explain with subsidies a model you host yourself (like deepseek)

▲

measurablefunc 3 hours ago | parent [-]

Then you are paying for the electricity. It's not physically possible to do more computation & not use more energy b/c every arithmetic operation requires a minimum amount of energy so more operations = more energy.

	▲	3 hours ago \| parent [-]
		[deleted]