| ▲ | coder543 3 hours ago | |
It's not that they can't do multiple pages... but did you compare against doing one page at a time? How many pages did you try in a single request? 5? 50? 500? I fully believe that 5 pages of input works just fine, but this does not scale up to larger documents, and the goal of OCR is usually to know what is actually written on the page... not what "should" have been written on the page. I think a larger number of pages makes it more likely for the LLM to hallucinate as it tries to "correct" errors that it sees, which is not the task. If that is a desirable task, I think it would be better to post-process the document with an LLM after it is converted to text, rather than asking the LLM to both read a large number of images and correct things at the same time, which is asking a lot. Once the document gets long enough, current LLMs will get lazy and stop providing complete OCR for every page in their response. One page at a time keeps the LLM focused on the task, and it's easy to parallelize so entire documents can be OCR'd quickly. | ||
| ▲ | staticman2 2 hours ago | parent [-] | |
I've been doing small PDFs- usually 5 or 6 pages in length. I never tested Gemini 3 PDF OCR compared to individual images but I can say it processes a small 6 page PDF better than the retired Gemini 1.5 or 2 did individual images. I agree that OCR and analysis should be two separate steps. | ||