OpenDataLoader-PDF: An open source tool for structured PDF parsing

emilburzo 3 hours ago | parent | next [-]

I just tested it on one of my nemeses: PDF bank statements. They're surprisingly tough to work with if you want to get clean, structured transaction data out of them.

The JSON extract actually looks pretty good and seems to produce something usable in one shot, which is very good compared to all the other tools I've tried so far, but I still need to check it more in-depth.

Sharing here in case someone chimes in with "hey, doofus, $magic_project already solves this."

	▲	vortex_ape an hour ago \| parent \| next [-]
		Camelot[1] worked very well for me with bank statements. Disclaimer: I'm one of the core contributors. [1] https://github.com/camelot-dev/camelot
	▲	dleeftink 2 hours ago \| parent \| prev [-]
		For 'zoned' extraction, Cermine[0] may be of use as a pre-processing step. Mileage may vary as its tailored towards papers. [0]: http://cermine.ceon.pl/about.html

▲

fedeb95 3 hours ago | parent | prev | next [-]

Very cool. I'll probably use it, but not for AI. I have lots of pdfs for which an epub doesn't exist.

Or if anything I'll add it to the projects-that-already-do-this-but-havent-yet-found list.

▲

trevor-e 4 hours ago | parent | prev | next [-]

I've been thinking lately that maybe we need a new AI-friendly file format rather than continuing to hack on top of PDF's complicated spec. PDF was designed to have consistent and portable page display rendering, it was not a goal for it to be easily parseable afaik, which is why we have to go through these crazy hoops. If you've ever looked at how text is stored internally in PDF this becomes immediately obvious.

I've been toying with an idea of a new format that stores text naturally and captures semantics (e.g. to help with table parsing), but also preserves formatting rules so you can still achieve fairly consistent rendering. This format could be easily converted to PDF, although the opposite conversion would have the regular challenges. The main challenge is distribution of course.

▲

s0rce 7 minutes ago | parent | next [-]

Doesn't Latex do this?

▲

Jaxan 3 hours ago | parent | prev [-]

Wouldn’t it be better to invest in a human-friendly format first (which also could be AI-friendly).

	▲	dotancohen an hour ago \| parent \| next [-]
		If you can convince your bank to make available your bank statement in Markdown, let us know. Your transactions are probably already available in CSV.
	▲	trevor-e 3 hours ago \| parent \| prev [-]
		Not really sure what you mean by a "human-friendly" file format, can you elaborate? File formats are inherently not friendly to humans, they are a bag of bytes. But that doesn't mean they can't be better consumed by tools which is what I mean by "AI friendly".

▲

clueless 5 hours ago | parent | prev | next [-]

Given the current llm context size limitation, what is the state of art for feeding large doc/text blobs into llm for accurate processing?

▲

simonw 4 hours ago | parent | next [-]

The current generation of models all support pretty long context now - the Gemini family has had 1m tokens for over a year, GPT-4.1 is 1m, interestingly GPT-5 is back down to 400,000, Claude 4 is 200,000 but there's a mode of Claude Sonnet 4 that can do 1m as well.

The bigger question is how well they perform - there are needle-in-haystack benchmarks that test that, they're mostly scoring quite highly on those now.

https://cloud.google.com/blog/products/ai-machine-learning/t... talks about that for Gemini 1.5.

Here's a couple of relevant leaderboards: https://huggingface.co/spaces/RMT-team/babilong and https://longbench2.github.io/

	▲	clueless 4 hours ago \| parent [-]
		sorry I should have been more clear, I meant around open source llms. and I guess the question is, how are closed source llm doing it so well. And if OS OpenNote is the best we have...

▲

lysecret 4 hours ago | parent | prev [-]

Generally use 2.5 flash for this, works incredibly well. So many traditionally hard things can now we solved by stuffing it into a pretty cheap llm haha.

	▲	mekael 3 hours ago \| parent [-]
		What do you mean by “traditionally hard” in relation to a pdf? Most if not all of the docs I’m tasked with parsing are secured, flattened, and handwritten, which can cause any tool (traditional or ai) to require a confidence score and manual intervention. Also might be that i just get stuck with the edge cases 90% of the time.

▲

agsqwe 3 hours ago | parent | prev | next [-]

How does it compare to docling?

▲

favorited 33 minutes ago | parent [-]

Docling primarily uses AI models to extract PDF content, this project looks like it uses a custom parser written in Java, built atop veraPDF.

	▲	brumar 25 minutes ago \| parent [-]
		Correct me if I am wrong, but Docling can do both. It has also, among other strategies, a non-AI pipeline to determine the layout (based on qpdf I believe). So these projects are not that different.

▲

constantinum 2 hours ago | parent | prev [-]

There is also Unstract open-source. Structured data extraction + ETL. https://github.com/Zipstack/unstract