Remix.run Logo
aabhay 6 hours ago

In Gemini at least, if you look at how they process PDFs, they do an OCR and then feed the text + image to the model, without charging you for the text tokens (I believe).

So my guess is that Claude’s backend is doing the same — so this hack is probably more of a loophole in token accounting that might get closed if Claude is doing what Gemini does

hn_throwaway_99 5 hours ago | parent | next [-]

This is really fascinating to me. I was reading this article and originally agreed with you, "I mean, under the covers it's got to be converting to text tokens at some point, so there is no way it's actually cheaper for Claude itself to execute."

But then there is a comment below talking about how DeepSeek was able to get a huge improvement in compression by using visual tokens, https://news.ycombinator.com/item?id=48777848. I don't fully understand all of the underlying technical details so I am still fundamentally baffled about how going the OCR route could actually result in overall electricity/computational savings.

stingraycharles 44 minutes ago | parent | next [-]

“I mean, under the covers it's got to be converting to text tokens at some point”

Multi-modal models do actually natively tokenize images, though. So it doesn’t have to be converted to text for it to work. They may do it anyway for accuracy, but it’s not at all required.

Effectively an image is scaled to a standard size, rasterized / cut up, and each cut is assigned a separate token, much in the same way text is tokenized. Train the model on this as well and you’ll end up having a model that can understand images.

supern0va an hour ago | parent | prev | next [-]

>This is really fascinating to me. I was reading this article and originally agreed with you, "I mean, under the covers it's got to be converting to text tokens at some point, so there is no way it's actually cheaper for Claude itself to execute."

It'd be weird if they were doing this, since it would mean the context window size was a lie and that the API would presumably reject requests whose expanded form went over the 1m limit. For someone using pxpipe with an effective context compression of 90% in some instances, it'd hit the limit at barely 100k.

qingcharles an hour ago | parent | prev | next [-]

I am trying to get rough summaries of long PDFs of scanned pages of text. At first I was doing OCR and passing the (tens of thousands of) characters into the LLM, which works, but it's expensive.

I asked Gemini how to save costs and it said just send in all the images of the pages instead. Instinctively, as a developer, it's hard to fathom how sending 200 images is cheaper than sending the text, but it definitely works.

yorwba 3 hours ago | parent | prev | next [-]

LLMs have a very bloated in-memory representation for text, on the order of megabytes of KV cache per byte of text. Meanwhile, for images a lossy representation is considered acceptable and it only takes up maybe a kilobyte of KV cache per byte of image. So if you can render your text into a hundred bytes of image per byte of text and then lossily expand it into 100 kB of KV cache per byte of text, you come out ahead!

Whether such lossy compression is acceptable for your use case is up to you.

Taek 3 hours ago | parent | next [-]

I don't think it's that bad, if I recall correctly it's about 8 kilobytes per token, and a token can be 3-4 characters so you're talking ~2 kilobytes per character.

An image token I recall is something like 16x16, so you get 32 bytes of overhead per pixel. And a character is minimally like 20 pixels including the whitespace, so you've jumped from 4 characters per token to maybe 12.

So 3x savings... which actually maps pretty closely to 60% savings.

esafak 3 hours ago | parent | prev [-]

Can we get some refs for this number? If true it sounds like poor design.

Tuna-Fish 2 hours ago | parent [-]

It's not quite as bad as the parent made it out to be, the largest I've seen is 32kB per token (where sometimes, a token represents a byte, but usually it represents more than one.)

It's forced by the nature of how LLMs use vector embeddings for language.

Basically, a single token in a LLM is represented as a n-element vector, where n is the "hidden dimension", also known as model dimension. In order for the model to be smart, the hidden dimension needs to be large, on the order of 2^16 on top-tier models. Elements of this vector are typically quantized to 2-byte floats, or sometimes smaller. Every possible fact is embedded as a direction in this very many dimensional vector space, and a token is related to a fact if the vector representing that token points into a similar direction as that fact. You can do vector math about these things, famously for most trained models, if you find the vector embedding for king, man, woman and queen, and calculate king - man + woman, the result is very close to queen.

(Does that mean that there are 2^16 possible different kinds facts about things in this model? No, because high-dimensional geometry is very unintuitively powerful. The facts are not axis-aligned, and they don't need to be perfectly non-orthogonal. This matters, because the numbers of individual vectors you can fit into a single 2^16 dimensional space that are orthogonal with each other (all angles 90degrees) is of course 2^16. But, if you allow for almost orthogonal vectors, the number is larger than the amount of atoms in the universe. If this sounds wacky, for people with a CS background it can help to think it working a bit like a bloom filter, in that collisions are possible. Although in actuality they are theoretical, because 2^16 is a very large number.)

satvikpendem 3 hours ago | parent | prev | next [-]

Baidu released a faster OCR model as well: https://github.com/baidu/Unlimited-OCR

DANmode 5 hours ago | parent | prev [-]

It wouldn’t, they’re subsidizing it for training.

Edit: didn’t realize this occurred on local models(!!),

this is smarter https://news.ycombinator.com/item?id=48779884

NooneAtAll3 4 hours ago | parent [-]

can't explain with subsidies a model you host yourself (like deepseek)

measurablefunc 3 hours ago | parent [-]

Then you are paying for the electricity. It's not physically possible to do more computation & not use more energy b/c every arithmetic operation requires a minimum amount of energy so more operations = more energy.

3 hours ago | parent [-]
[deleted]
michaelt 2 hours ago | parent | prev | next [-]

Not necessarily. See the paper See "DeepSeek-OCR: Contexts Optical Compression" [1]

One option, when an image is fed into an LLM, is to divide it into tiles, then those tiles pass through a 'vision encoder' neural network to make 'vision tokens' which are then input into the LLM much like text tokens are. Obviously you train the vision encoder and LLM to understand one another. This is known as an 'end-to-end OCR model'.

And it turns out, once you've trained a model to do this, you can vary the number of 'vision tokens' used to represent a given text document by scaling an image of a document up or down, and see what happens. You also get a load of other parameters like patch size and vision encoder complexity and so on.

Turns out it works really well; in some tests they used 90% fewer input tokens, but still got 97% output performance.

[1] https://arxiv.org/abs/2510.18234

Gooblebrai 4 hours ago | parent | prev [-]

Claude Science has a tool to extract the PDF but not sure if it's OCR'ing it.