Remix.run Logo
calebkaiser 5 hours ago

Nah, optical compression is a thing. You see it in a lot of different areas in ML. In this case, the "trick" has been known for a while, and belongs to a whole world of compression research. But I think where you're maybe getting mixed up is in where that 60% gain is coming from.

It's not a 60% percent reduction in cost for 100% of the same output. If you have a model and input text A, and you fix the seed etc. and run Text A through the model as text tokens and as compressed image tokens, you will not get identical outputs. You're specifically reducing the number of tensors needed to represent your input, which saves you on raw compute, but also by definition gives you less room to represent the information in your input. It's lossy, in other words.

Put another way, if you're using a model like Fable because you need the absolute frontier of capability and cheaper models cannot solve your tasks, then there is a very real chance that a compression strategy like this drops Fable's accuracy such that it's no longer suitable for your task. Which defeats the point of you paying for the most expensive model in the first place.

So, it's cool research. Might be useful for some people. Probably isn't something that has incredible utility in real use cases.

rightbyte 5 hours ago | parent [-]

> a compression strategy

To me compression implies smaller size? However new line chars seems to be removed in the pic so I guess it could be expressed in fewer bytes than the original text with further compression ...

yorwba 4 hours ago | parent [-]

The size is indeed smaller, because text tokens and image tokens are embedded as vectors of the same size, but text tokens typically only cover a few characters, while image tokens typically cover many pixels, so many that you can fit more characters in there. So the same text takes up fewer tokens as an image, and hence requires less time and memory to process.

You could also imagine models where text tokens cover many characters and image tokens just a few pixels, which would invert the relationship, but this is typically suboptimal for the applications people have in mind when they train a model.

jayd16 3 hours ago | parent [-]

So split the difference and start encoding input at the words or phrases level?

calebkaiser 3 hours ago | parent [-]

Lots of researchers have done just this! There's a really rich history of research + lots of contemporary work on different encoding/representation strategies. This might be interesting to you: https://sbert.net/

What makes the DeepSeek-OCR and related results exciting to some researchers is less about the fact that you could devise a tokenization scheme that has fewer tokens, and more about how well it works.