Remix.run Logo
ritvikpandey21 4 days ago

we disagree! we've found llms by themselves aren't enough and suffer from pretty big failure modes like hallucination and inferring text rather than pure transcription. we wrote a blog about this [1]. the right approach so far seems to be a hybrid workflow that uses very specific parts of the language model architecture.

[1] https://www.runpulse.com/blog/why-llms-suck-at-ocr

mritchie712 4 days ago | parent | next [-]

> Why LLMs Suck at OCR

I paste screenshots into claude code everyday and it's incredible. As in, I can't believe how good it is. I send a screenshot of console logs, a UI and some HTML elements and it just "gets it".

So saying they "Suck" makes me not take your opinion seriously.

ritvikpandey21 4 days ago | parent | next [-]

yeah models are definitely improving, but we've found even the latest ones still hallucinate and infer text rather than doing pure transcription. we carry out very rigorous benchmarks against all of the frontier models. we think the differentiation is in accuracy on truly messy docs (nested tables, degraded scans, handwriting) and being able to deploy on-prem/vpc for regulated industries.

mikert89 4 days ago | parent | prev [-]

they need to convince customers its what they need

serjester 4 days ago | parent | prev | next [-]

This is a hand wavy article that dismisses away VLMs without acknowledging the real world performance everyone is seeing. I think it’d be far more useful if you published an eval.

mikert89 4 days ago | parent | prev [-]

one or two more model releases, and raw documents passed to claude will beat whatever prompt voodoo you guys are cooking

holler 4 days ago | parent [-]

Having worked in the space I have real doubts about that. Right now Claude and other top models already do a decent job at e.g. "generate OCR from this document". But as mentioned there are serious failure modes, it's non-deterministic, and especially cost-prohibitive at scale.