| ▲ | staticman2 3 hours ago | |||||||
Gemini Pro 3 seems to be built for handling multiple page PDFs. I can feed it a multiple page PDF and tell it to convert it to markdown and it does this well. I don't need to load the pages one at a time as long as I use the PDF format. (This was tested on A.i. studio but I think the API works the same way). | ||||||||
| ▲ | coder543 3 hours ago | parent [-] | |||||||
It's not that they can't do multiple pages... but did you compare against doing one page at a time? How many pages did you try in a single request? 5? 50? 500? I fully believe that 5 pages of input works just fine, but this does not scale up to larger documents, and the goal of OCR is usually to know what is actually written on the page... not what "should" have been written on the page. I think a larger number of pages makes it more likely for the LLM to hallucinate as it tries to "correct" errors that it sees, which is not the task. If that is a desirable task, I think it would be better to post-process the document with an LLM after it is converted to text, rather than asking the LLM to both read a large number of images and correct things at the same time, which is asking a lot. Once the document gets long enough, current LLMs will get lazy and stop providing complete OCR for every page in their response. One page at a time keeps the LLM focused on the task, and it's easy to parallelize so entire documents can be OCR'd quickly. | ||||||||
| ||||||||