Remix.run Logo
aliljet 5 hours ago

This is actually the thing I really desperately need. I'm routinely analyzing contracts that were faxed to me, scanned with monstrously poor resolution, wet signed, all kinds of shit. The big LLM providers choke on this raw input and I burn up the entire context window for 30 pages of text. Understandable evals of the quality of these OCR systems (which are moving wicked fast) would be helpful...

And here's the kicker. I can't afford mistakes. Missing a single character or misinterpreting it could be catastrophic. 4 units vacant? 10 days to respond? Signature missing? Incredibly critical things. I can't find an eval that gives me confidence around this.

coder543 5 hours ago | parent | next [-]

If you want OCR with the big LLM providers, you should probably be passing one page per request. Having the model focus on OCR for only a single page at a time seemed to help a lot in my anecdotal testing a few months ago. You can even pass all the pages in parallel in separate requests, and get the better quality response much faster too.

But, as others said, if you can't afford mistakes, then you're going to need a human in the loop to take responsibility.

staticman2 3 hours ago | parent | next [-]

Gemini Pro 3 seems to be built for handling multiple page PDFs.

I can feed it a multiple page PDF and tell it to convert it to markdown and it does this well. I don't need to load the pages one at a time as long as I use the PDF format. (This was tested on A.i. studio but I think the API works the same way).

coder543 3 hours ago | parent [-]

It's not that they can't do multiple pages... but did you compare against doing one page at a time?

How many pages did you try in a single request? 5? 50? 500?

I fully believe that 5 pages of input works just fine, but this does not scale up to larger documents, and the goal of OCR is usually to know what is actually written on the page... not what "should" have been written on the page. I think a larger number of pages makes it more likely for the LLM to hallucinate as it tries to "correct" errors that it sees, which is not the task. If that is a desirable task, I think it would be better to post-process the document with an LLM after it is converted to text, rather than asking the LLM to both read a large number of images and correct things at the same time, which is asking a lot.

Once the document gets long enough, current LLMs will get lazy and stop providing complete OCR for every page in their response.

One page at a time keeps the LLM focused on the task, and it's easy to parallelize so entire documents can be OCR'd quickly.

staticman2 2 hours ago | parent [-]

I've been doing small PDFs- usually 5 or 6 pages in length.

I never tested Gemini 3 PDF OCR compared to individual images but I can say it processes a small 6 page PDF better than the retired Gemini 1.5 or 2 did individual images.

I agree that OCR and analysis should be two separate steps.

HPsquared 4 hours ago | parent | prev [-]

You could maybe then do a second pass on the whole text (as plain text not OCR) to look for likely mistakes.

kergonath 4 hours ago | parent [-]

This is not always easy. The models I tried were too helpful and rewrote too much instead of fixing simple typos. When I tried I ended up with huge prompts and I still found sentences where the LLM was too enthusiastic. I ended up applying regexes with common typos and accepted some residual errors. It might be better now, though. But since then I’ve moved to all-in-one solutions like Mathpix and Mistral-OCR which are quite good for my purpose.

4 hours ago | parent | prev | next [-]
[deleted]
chrsw 5 hours ago | parent | prev | next [-]

I'm keeping my eye on progress in this area as well. I need to free engineering design data from tens of thousands of PDF pages and make them easily and quickly accessible to LLMs.

aliljet 5 hours ago | parent [-]

All of healthcare is crying. Trust me.

Imustaskforhelp 5 hours ago | parent [-]

I suppose tears of joy?

fragmede 4 hours ago | parent [-]

Of sadness because they're not allowed to use it yet.

daveguy 5 hours ago | parent | prev | next [-]

If your needs are that sensitive, I doubt you'll find anything anytime soon that doesn't require a human in the loop. Even SOTA models only average 95% accuracy on messy inputs. If that's a per character accuracy (which OCR is generally measured by), that's going to be 5+ errors per page of 100+ words. If you really can't afford mistakes you have to consider the OCR inaccurate. If you have key components like "days to respond" and "units vacant" you need to identify the presence of those specifically with bias in favor of false positives (over false negatives), and human confirmation of the source-> OCR.

kergonath 4 hours ago | parent [-]

> If you really can't afford mistakes you have to consider the OCR inaccurate.

Isn’t this close to the error rate of human transcription for messy input, though? I seem to remember a figure in that ballpark. I think if your use case is this sensitive, then any transcription is suspicious.

aliljet 3 hours ago | parent [-]

This is precisely the real question. If you're exceeding human transcription, you may be generally pretty good. The question is what happens when you tell a human to become surgical about some part of the document, how then does the comparison change..

renewiltord 2 hours ago | parent | prev | next [-]

I’m sure you’ve tried all this but you’ve tried inter-rater agreement via multiple attempts on same LLM vs different LLM? Perhaps your system would work better if you ran it through 5 models 3 times and then highlighted diffs for human chooser.

cinntaile 5 hours ago | parent | prev | next [-]

Deciphering fax messages? What is this, the 90s?

kergonath 4 hours ago | parent | next [-]

We have decades of internal reports on film that we’d like to make accessible and searchable. We don’t do it with new documents, but we have a huge backlog.

xyproto 4 hours ago | parent | prev [-]

Fax is still hard to hack, so some organizations have kept it alive for security.

4 hours ago | parent | prev [-]
[deleted]