Remix.run Logo
Show HN: Doc2dict a fast, open-source document to dict converter – No AI
3 points by jgfriedman1999 6 hours ago

doc2dict is a python package that converts html and pdf documents into dictionaries preserving hierarchy. It also supports table extraction for html files. https://github.com/john-friedman/doc2dict

Speed:

* html - 500 pages per second single threaded.

* pdf - 200 pages per second, pdf must have an underlying text structure. Multithreading is not possible due to the limitations of PDFium.

Here's an example output from Microsoft's Annual Report: > "title": "PART I", "standardized_title": "parti", "class": "part", "contents": { "38": { "title": "ITEM 1. BUSINESS", "standardized_title": "item1", "class": "item", "contents": { "39": { "title": "GENERAL", "standardized_title": "", "class": "predicted header", "contents": { "40": { "title": "Embracing Our Future", "standardized_title": "", "class": "predicted header", "contents": { "41": { "text": "Microsoft is a technolo...

Raw: https://html-preview.github.io/?url=https://raw.githubuserco...

Parsed dictionary: https://github.com/john-friedman/doc2dict/blob/main/example_...

Simple description of algorithm:

* Take complicated document such as pdf or html, and created a simplified representation for it as a list of a list of dicts where each dict is a text block with key features such as "bold", "font-size", etc and each line represents a new html block or line on a pdf.

* Convert the simplified representation to a dictionary using a set of predetermined rules, e.g. smaller font-size for a heading means it should be nested under the larger font-size heading.

Note that I am working on making the last part more modular by creating predetermined instructions that users can tweak for their use-case without rewriting the parser. I call these "mapping dicts".

doc2dict also includes visualization tools for the debugging process:

* visualize simplified representation https://html-preview.github.io/?url=https://github.com/john-...

* visualize output dictionary https://html-preview.github.io/?url=https://github.com/john-...

Why I made this: I'm currently working on another open source python package to make it easy to exploit Securities & Exchanges Commission data. Writing a generalized document parser that can be tweaked is easier than writing 100 or so specialized parsers for each document type.

Also, converting html and pdf files to dictionary representation reduces document size by a factor of 10 or so. Not sure what I can do with that, but planning on some fun NoSQL database experiments.

Link to other package (datamule) https://github.com/john-friedman/datamule-python