Remix.run Logo
firesteelrain 4 days ago

There is a very popular Python module called ocrmypdf. I used it to help my HOA and OCR’ing of old PDFs.

https://github.com/ocrmypdf/OCRmyPDF

No LLMs required.

dreamcompiler 2 days ago | parent | next [-]

20 years ago I tried in vain to get my HOA to use the virtual printer for PDF documents so they'd be searchable. The capability was built in to both Mac and Windows even way back then.

No luck. They just could not grasp it. So they kept using their process of printing out the file on paper and then scanning it back in as a PDF image file.

I finally quit trying. Now of course they've seen the light and are painstakingly OCRing all that old stuff.

firesteelrain 2 days ago | parent [-]

Ouch! I am on the BOD so as an IT/Engineering Professional I can influence things better

cess11 3 days ago | parent | prev [-]

It's nice, I've used it as a fallback text extraction method in an ETL flow that chugged through tens of thousands of corporate and legal PDF files.