| ▲ | hn_throwaway_99 5 hours ago | |||||||||||||||||||||||||
This is really fascinating to me. I was reading this article and originally agreed with you, "I mean, under the covers it's got to be converting to text tokens at some point, so there is no way it's actually cheaper for Claude itself to execute." But then there is a comment below talking about how DeepSeek was able to get a huge improvement in compression by using visual tokens, https://news.ycombinator.com/item?id=48777848. I don't fully understand all of the underlying technical details so I am still fundamentally baffled about how going the OCR route could actually result in overall electricity/computational savings. | ||||||||||||||||||||||||||
| ▲ | stingraycharles 42 minutes ago | parent | next [-] | |||||||||||||||||||||||||
“I mean, under the covers it's got to be converting to text tokens at some point” Multi-modal models do actually natively tokenize images, though. So it doesn’t have to be converted to text for it to work. They may do it anyway for accuracy, but it’s not at all required. Effectively an image is scaled to a standard size, rasterized / cut up, and each cut is assigned a separate token, much in the same way text is tokenized. Train the model on this as well and you’ll end up having a model that can understand images. | ||||||||||||||||||||||||||
| ▲ | supern0va an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||
>This is really fascinating to me. I was reading this article and originally agreed with you, "I mean, under the covers it's got to be converting to text tokens at some point, so there is no way it's actually cheaper for Claude itself to execute." It'd be weird if they were doing this, since it would mean the context window size was a lie and that the API would presumably reject requests whose expanded form went over the 1m limit. For someone using pxpipe with an effective context compression of 90% in some instances, it'd hit the limit at barely 100k. | ||||||||||||||||||||||||||
| ▲ | qingcharles an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||
I am trying to get rough summaries of long PDFs of scanned pages of text. At first I was doing OCR and passing the (tens of thousands of) characters into the LLM, which works, but it's expensive. I asked Gemini how to save costs and it said just send in all the images of the pages instead. Instinctively, as a developer, it's hard to fathom how sending 200 images is cheaper than sending the text, but it definitely works. | ||||||||||||||||||||||||||
| ▲ | yorwba 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||
LLMs have a very bloated in-memory representation for text, on the order of megabytes of KV cache per byte of text. Meanwhile, for images a lossy representation is considered acceptable and it only takes up maybe a kilobyte of KV cache per byte of image. So if you can render your text into a hundred bytes of image per byte of text and then lossily expand it into 100 kB of KV cache per byte of text, you come out ahead! Whether such lossy compression is acceptable for your use case is up to you. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | satvikpendem 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||
Baidu released a faster OCR model as well: https://github.com/baidu/Unlimited-OCR | ||||||||||||||||||||||||||
| ▲ | DANmode 5 hours ago | parent | prev [-] | |||||||||||||||||||||||||
It wouldn’t, they’re subsidizing it for training. Edit: didn’t realize this occurred on local models(!!), this is smarter https://news.ycombinator.com/item?id=48779884 | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||