▲ | david_draco 4 days ago | ||||||||||||||||||||||
Looking at the code, this converts PDF pages to images, then transcribes each image. I might have expected a pdftotext post-processor. The complexity of PDF I guess ... | |||||||||||||||||||||||
▲ | firesteelrain 4 days ago | parent | next [-] | ||||||||||||||||||||||
There is a very popular Python module called ocrmypdf. I used it to help my HOA and OCR’ing of old PDFs. https://github.com/ocrmypdf/OCRmyPDF No LLMs required. | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | westurner 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||
Shell: GNU parallel, pdftotext Python: PyPdf2, PdfMiner.six, Grobid, PyMuPdf; pytesseract (C++) paperetl is built on grobid: https://github.com/neuml/paperetl annotateai: https://github.com/neuml/annotateai : > annotateai automatically annotates papers using Large Language Models (LLMs). While LLMs can summarize papers, search papers and build generative text about papers, this project focuses on providing human readers with context as they read. pdf.js-hypothes.is: https://github.com/hypothesis/pdf.js-hypothes.is: > This is a copy of Mozilla's PDF.js viewer with Hypothesis annotation tools added Hypothesis is built on the W3C Web Annotations spec. dokieli implements W3C Web Annotations and many other Linked Data Specs: https://github.com/dokieli/dokieli : > Implements versioning and has the notion of immutable resources. > Embedding data blocks, e.g., Turtle, N-Triples, JSON-LD, TriG (Nanopublications). A dokieli document interface to LLMs would be basically the anti-PDF. Rust crates: rayon handles parallel processing, pdf-rs, tesseract (C++) pdf-rs examples/src/bin/extract_page.rs: https://github.com/pdf-rs/pdf/blob/master/examples/src/bin/e... | |||||||||||||||||||||||
▲ | moritonal 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||
I imagine part of the issue is how many PDFs are just a series of images anyway. | |||||||||||||||||||||||
▲ | enjaydee 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||
Saw this tweet the other day that helped me understand just how crazy PDF parsing can be | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | ethan_smith 3 days ago | parent | prev [-] | ||||||||||||||||||||||
Image-based extraction often preserves layout and handles PDFs with embedded fonts, scanned content, or security restrictions better than direct text extraction methods. |