Remix.run Logo
geor9e 6 hours ago

Step back and think about it another way - ask which scenario is more likely:

Some random person discovered a 60% across the board gain in all LLMs, using an extremely simple trick that none of the labs noticed in all these years. That trick being to rasterize 8bit characters into 8x8 pixels in a big image. 60% in a market worth trillions of dollars.

or

Anthropic's marketing team arbitrarily prices tokens to drive growth, according to vibes and feelings, and didn't think they needed to price images on par with text in their rush to burn cash & drive growth. Some folks take advantage of the trick during the first few days of the model's availability before Anthopic corrects their pricing, to align more proportionally with actual compute costs.

calebkaiser 6 hours ago | parent | next [-]

Nah, optical compression is a thing. You see it in a lot of different areas in ML. In this case, the "trick" has been known for a while, and belongs to a whole world of compression research. But I think where you're maybe getting mixed up is in where that 60% gain is coming from.

It's not a 60% percent reduction in cost for 100% of the same output. If you have a model and input text A, and you fix the seed etc. and run Text A through the model as text tokens and as compressed image tokens, you will not get identical outputs. You're specifically reducing the number of tensors needed to represent your input, which saves you on raw compute, but also by definition gives you less room to represent the information in your input. It's lossy, in other words.

Put another way, if you're using a model like Fable because you need the absolute frontier of capability and cheaper models cannot solve your tasks, then there is a very real chance that a compression strategy like this drops Fable's accuracy such that it's no longer suitable for your task. Which defeats the point of you paying for the most expensive model in the first place.

So, it's cool research. Might be useful for some people. Probably isn't something that has incredible utility in real use cases.

rightbyte 5 hours ago | parent [-]

> a compression strategy

To me compression implies smaller size? However new line chars seems to be removed in the pic so I guess it could be expressed in fewer bytes than the original text with further compression ...

yorwba 4 hours ago | parent [-]

The size is indeed smaller, because text tokens and image tokens are embedded as vectors of the same size, but text tokens typically only cover a few characters, while image tokens typically cover many pixels, so many that you can fit more characters in there. So the same text takes up fewer tokens as an image, and hence requires less time and memory to process.

You could also imagine models where text tokens cover many characters and image tokens just a few pixels, which would invert the relationship, but this is typically suboptimal for the applications people have in mind when they train a model.

jayd16 3 hours ago | parent [-]

So split the difference and start encoding input at the words or phrases level?

calebkaiser 3 hours ago | parent [-]

Lots of researchers have done just this! There's a really rich history of research + lots of contemporary work on different encoding/representation strategies. This might be interesting to you: https://sbert.net/

What makes the DeepSeek-OCR and related results exciting to some researchers is less about the fact that you could devise a tokenization scheme that has fewer tokens, and more about how well it works.

vineyardmike 6 hours ago | parent | prev | next [-]

> Some random person discovered a 60% across the board gain in all LLMs, using an extremely simple trick that none of the labs noticed in all these years of multi-trillion dollar growth

DeepSeek published a pretty well circulated paper on exactly this many months ago. It just hasn’t been attempted and shared publicly, asa retrofit, AFAIK.

Also, it’s no free lunch, the readme indicates that this “use images” hack is lossy and reduces success rates alongside the reduced cost. Most labs would focus on success increases regardless of price.

geor9e 6 hours ago | parent [-]

If the trick were genuinely useful, and was well circulated months ago, the resource-starved inference providers would have squeezed this trick dry already, instead of wasting 60% of their tokens, waiting for users to implement it themselves in 5 minutes of effort.

Klathmon 3 hours ago | parent | next [-]

That's like saying quantization isn't real because the frontier labs aren't using it in their production inference.

This is a lossy process, it produces worse results. It might be worth it for some situations, but applying it to everything would just be making your SOTA model worse

ptx 2 hours ago | parent [-]

Isn't this just quantization with extra steps? Can converting the text to an image really be a better way to lossily compress it? (Not that I have any idea what I'm talking about on this topic.)

Klathmon an hour ago | parent [-]

I also have no idea what I'm talking about, but to me this seems closer to the "caveman mode" that some people use to compress info into fewer tokens. Going through the image tokenizer allows you to leave the source text untouched while still gaining (some of?) the benefits

solenoid0937 5 hours ago | parent | prev [-]

[flagged]

satvikpendem 3 hours ago | parent | prev | next [-]

An economist walks past a hundred dollar bill on the ground because someone would've picked it up already if it were real.

Aurornis 5 hours ago | parent | prev | next [-]

I think you missed the part where this is a lossy technique that reduces performance.

The image trick reduces context because it’s lossy. The README says you can’t use it for anything needing exact recall. It produces a gist of the input.

You could achieve something similar by using a small, cheap model to pre-summarize information for the expensive LLM. This is what many people do already and it’s a much better way to do it for most situations.

jug 5 hours ago | parent | prev | next [-]

Alternative 1 isn’t all that unlikely given Opus 4.8 couldn’t do this. So it’s a recently possible hack. Not something LLM corps have been blindsided by for years. I also strongly recommend RTFA in this case, namely ”The honest part, read before relying on it”

stevenhuang 5 hours ago | parent | prev [-]

This has been known since VLMs were a thing, that more information can be encoded visually and token efficiency is increased. But it came with performance issues (more hallucinations, etc).

Also I don't think you realize how much dumb stuff is still left on the table. That the market is worth trillions is quite irrelevant here given the dynamism of the field.