Remix.run Logo
HocusLocus 4 days ago

By 1990 Omnipage 3 and its successors were 'good enough' and with their compact dictionaries and letter form recognition were miracles of their time at ~300MB installed.

In 2025 LLMs can 'fake it' using Trilobites of memory and Petaflops. It's funny actually, like a supercomputer being emulated in real time on a really fast Jacquard loom. By 2027 even simple hand held calculator addition will be billed in kilowatt-hours.

privatelypublic 3 days ago | parent | next [-]

If you think 1990's ocr- even 2000's OCR is remotely as good as modern OCR... I`v3 g0ta bnedge to sell.

skygazer 3 days ago | parent | next [-]

I had an on-screen OCR app on my Amiga in the early 90s that was amazing, so long as the captured text image used a system font. Avoiding all the mess of reality like optics, perspective, sensors and physics and it could be basically perfect.

privatelypublic 3 days ago | parent | next [-]

If you want to go back to the start, look up MICR. Used to sort checks.

OCR'ing a fixed, monospaced, font from a pristine piece of paper really is "solved." It's all the nasties of tue real world that its an issue.

As I mockingly demonstrated- kerning, character similarity, grammar, lexing- all present large and hugely time consuming problems to solve in processes where OCR is the most useful.

Someone 3 days ago | parent | prev [-]

MacPaint had that in 1983, but it never shipped because Bill Atkinson “was afraid that if he left it in, people would actually use it a lot, and MacPaint would be regarded as an inadequate word processor instead of a great drawing program” (https://www.folklore.org/MacPaint_Evolution.html)

Also shows a way to do that fast:

“ First, he wrote assembly language routines to isolate the bounding box of each character in the selected range. Then he computed a checksum of the pixels within each bounding box, and compared them to a pre-computed table that was made for each known font, only having to perform the full, detailed comparison if the checksum matched.”

bayindirh 3 days ago | parent | prev | next [-]

Tesseract can do wonders for scanned paper (and web generated PDFs) both in its old and new version. If you want to pay for something closed, Prizmo on macOS is extremely good as well.

On the other hând, LLm5 are sl0wwer, moré resource hangry and l3ss accurale fr their outpu1z.

We shoulD stop gl0rıfying LLMs for 3verylhin9.

agentcoops 3 days ago | parent [-]

I've worked extensively with Tesseract, ABBYY, etc in a personal and professional context. Of course they work well for English-language documents without any complexity of layout that are scanned without the slightest defect. At this point, based on extensive testing for work, state of the art LLMs simply have better accuracy -- and an order of magnitude so if you have non-English documents with complex layouts and less than ideal scans. I'll give you speed, but the accuracy is so much greater (and the need for human intervention so much less) that in my experience it's a worthwhile trade-off.

I'm not saying this applies to you, but my sense from this thread is that many are comparing the results of tossing an image into a free ChatGPT session with an "OCR this document" prompt to a competent Tesseract-based tool... LLMs certainly don't solve any and every problem, but this should be based on real experiments. In fact, OCR is probably the main area where I've found them to simply be the best solution for a professional system.

privatelypublic 3 days ago | parent [-]

Yea. As usual, I inarticulately didn't make a good argument for my point. A tuned system with optimized workflow will by far have the best results. And- maybe llms will be a key resource in bringing the OCR into usable/profitable areas.

But, theres also a ton of "I don't want to deal with this" type work items that can't justify a full workflow process build out- but that LLMs get near enough to perfect to be "good enough." The bad part is, the LLMs don't explain to people the kinds of mistakes to expect from them.

throwaway1777 3 days ago | parent | prev [-]

[dead]

Y_Y 3 days ago | parent | prev | next [-]

https://en.wikipedia.org/wiki/Trilobite

Trilobites? Those were truly primitve computers.

__alexs 3 days ago | parent [-]

Didn't the discworld books have these?

jchw 3 days ago | parent | prev [-]

A bit ago I tried throwing a couple of random simple Japanese comics (think 4koma but I don't think either of the ones I threw in were actually 4 panels) from Pixiv into Gemma 3b on AI studio.

- It transcribed all of the text, including speech, labels on objects, onomatopoeias in actions, etc. I did notice a kana was missing a diacritic in a transcription, so the transcriptions were not perfect, but pretty close actually. To my eye all of the kanji looked right. Latin characters already OCR pretty well, but at least in my experience other languages can be a struggle.

- It also, unprompted, correctly translated the fairly simple Japanese to English. I'm not an expert, but the translations looked good to me. Gemini 2.5 did the same, and while it had a slightly different translation, both of them were functionally identical, and similar to Google Translate.

- It also explained the jokes, the onomatopoeias, etc. To my ability to verify these things they seemed to be correct, though notably Japanese onomatopoeias used for actions in comics is pretty diverse and not necessarily super well-documented. But contextually it seemed right.

To me this is interesting. I don't want to anthropomorphize the models (at least unduly, though I am describing the models as if they chose to do these things, since it's natural to do so) but the fact that even relatively small local models such as Gemma can perform tasks like this on arbitrary images with handwritten Japanese text bodes well. Traditional OCR struggles to find and recognize text that isn't English or is stylized/hand-written, and can't use context clues or its own "understanding" to fill in blanks where things are otherwise unreadable; at best they can take advantage of more basic statistics, which can take you quite far but won't get you to the same level of proficiency at the job as a human. vLLMs however definitely have an advantage in the amount of knowledge embedded within them, and can use that knowledge to cut through ambiguity. I believe this gets them closer.

I've messed around with using vLLMs for OCR tasks a few times primarily because I'm honestly just not very impressed with more traditional options like Tesseract, which sometimes need a lot of help even just to find the text you want to transcribe, depending on how ideal the case is.

On the scale of AI hype bullshit, the use case of image recognition and transcription is damn near zero. It really is actually useful here. Some studies have shown that vLLMs are "blind" in some ways (in that they can be made to fail by tricking them, like Photoshopping a cat to have an extra leg and asking how many legs the animal in the photo has; in this case the priors of the model from its training data work against it) and there are some other limitations (I think generally when you use AI for transcription it's hard to get spatial information about what is being recognized, though I think some techniques have been applied, like recursively cutting an image up and feeding it to try to refine bounding boxes) but the degree to which it works is, in my honest opinion, very impressive and very useful already.

I don't think that this demonstrates that basic PDF transcription, especially of cleanly-scanned documents, really needs large ML models... But on the other hand, large ML models can handle both easy and hard tasks here pretty well if you are working within their limitations.

Personally, I look forward to seeing more work done on this sort of thing. If it becomes reliable enough, it will be absurdly useful for both accessibility and breaking down language barriers; machine translation has traditionally been a bit limited in how well it can work on images, but I've found Gemini, and surprisingly often even Gemma, can make easy work of these tasks.

I agree these models are inefficient, I mean traditional OCR aside, our brains do similar tasks but burn less electricity and ostensibly need less training data (at least certainly less text) to do it. It certainly must be physically possible to make more efficient machines that can do these tasks with similar fidelity to what we have now.

agentcoops 3 days ago | parent [-]

100%. My sense is that many in this thread have never gone through the misery of trying to use classical OCR for non-English documents or where you can't control scan quality. I did a test recently with 18th-century German documents, written in a well-known and standardized but archaic script. The accuracy of classical models specifically trained on this corpus was an order of magnitude lower than GPT5. I haven't experimented personally or professionally with smaller models, but your experience makes me hopeful that we might even get this accurate OCR on phones sooner rather than later...

bugglebeetle 3 days ago | parent [-]

William Mattingly has been doing a lot of work on similar documents in an archival context with VLLMs. You should check in on their work:

https://x.com/wjb_mattingly

https://github.com/wjbmattingly