Remix.run Logo
TZubiri 6 hours ago

Of course it isn't

A text encoding uses 8bits per character on average, tokenization further compresses that

An image font would be 25 bits if 5x5, and most fonts are 12 pixels high

Of course it isn't efficient, this is a pricing inefficiency and a hack to exploit it (even the author describes it as an exploit)

legel 6 hours ago | parent | next [-]

You are wrong.

Text tokens are high-dimensional vectors, not 8 bits per character. Every token has a deep embedding, e.g. 1024 float values per text token.

DeepSeek-OCR proved 10x+ compression from visual embedding of text, which was a groundbreaking result. [1]

Very cool to see OP's project hacking on this principle. It's still not lossless, as noted in the github, but is a promising research direction.

[1] https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSe...

nextaccountic an hour ago | parent | next [-]

Well, then we could presumably also add lossy compression to texts, without passing through images first

Groxx 5 hours ago | parent | prev | next [-]

I kinda wonder if it's extracting usable context from 2D proximity between lines? Normal text input wouldn't have that kind of information (though it could, and it's arguably just a lookahead/behind of N characters on average).

deburo 6 hours ago | parent | prev | next [-]

A token is probably not a single char, and an image is probably decomposed into tokens as well (and god knows how many tokens an image is decomposed into) which probably map to similar float-hungry vectors. Your counterargument could use a bit more flesh.

And we're talking about images of texts, not images that represent complex imagery such as a very detailed scene or what have you.

gamblor956 3 hours ago | parent | prev | next [-]

People really need to read their cites and not just the summaries.

The paper notes two things:

1) While the compression ratio for visual text is better than it is for regular text, but the absolute space required is still higher for the images. OPs were talking about the space required, not the ratio.

2) The results of the OCR must still be fed into a text-based LLM for linguistic processing. Otherwise, all you have achieved is turning an image into a bunch of text.

TZubiri 5 hours ago | parent | prev [-]

>Text tokens are high-dimensional vectors,

You are conflating tokens with embeddings.

Tokens fit in a single word, modern gpt uses a vocabulary with 200k possible values, which would fit into 18 bits.

Have a good one

netsharc 6 hours ago | parent | prev | next [-]

huh, what if the image encoding is 8 bits per R, G, B values of the pixel, then one can encode the same amount of text in less pixel dimensions (3 letters would need 1 pixel instead of three 12x12 pixels)

The top line can be the OCR-able instruction on how to decode the rest of the image, and the rest of the image would be random-looking colourful palette. It might not even need to use 8 bits per character, since ANSI is 7 bits/character.

TZubiri 5 hours ago | parent [-]

then it's no longer an image, as the one in the github repo, you would be encoding the text as characters and sending it as an image.

You can achieve this by changing the extension of an image file from .bmp to .txt

Guys, not to be mean, but maybe chill with the state of the art research and go back to studying fundamentals.

vineyardmike 6 hours ago | parent | prev [-]

[dead]