Remix.run Logo
legel 6 hours ago

You are wrong.

Text tokens are high-dimensional vectors, not 8 bits per character. Every token has a deep embedding, e.g. 1024 float values per text token.

DeepSeek-OCR proved 10x+ compression from visual embedding of text, which was a groundbreaking result. [1]

Very cool to see OP's project hacking on this principle. It's still not lossless, as noted in the github, but is a promising research direction.

[1] https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSe...

nextaccountic an hour ago | parent | next [-]

Well, then we could presumably also add lossy compression to texts, without passing through images first

Groxx 5 hours ago | parent | prev | next [-]

I kinda wonder if it's extracting usable context from 2D proximity between lines? Normal text input wouldn't have that kind of information (though it could, and it's arguably just a lookahead/behind of N characters on average).

deburo 6 hours ago | parent | prev | next [-]

A token is probably not a single char, and an image is probably decomposed into tokens as well (and god knows how many tokens an image is decomposed into) which probably map to similar float-hungry vectors. Your counterargument could use a bit more flesh.

And we're talking about images of texts, not images that represent complex imagery such as a very detailed scene or what have you.

gamblor956 3 hours ago | parent | prev | next [-]

People really need to read their cites and not just the summaries.

The paper notes two things:

1) While the compression ratio for visual text is better than it is for regular text, but the absolute space required is still higher for the images. OPs were talking about the space required, not the ratio.

2) The results of the OCR must still be fed into a text-based LLM for linguistic processing. Otherwise, all you have achieved is turning an image into a bunch of text.

TZubiri 5 hours ago | parent | prev [-]

>Text tokens are high-dimensional vectors,

You are conflating tokens with embeddings.

Tokens fit in a single word, modern gpt uses a vocabulary with 200k possible values, which would fit into 18 bits.

Have a good one