Remix.run Logo
westurner 4 days ago

Shell: GNU parallel, pdftotext

Python: PyPdf2, PdfMiner.six, Grobid, PyMuPdf; pytesseract (C++)

paperetl is built on grobid: https://github.com/neuml/paperetl

annotateai: https://github.com/neuml/annotateai :

> annotateai automatically annotates papers using Large Language Models (LLMs). While LLMs can summarize papers, search papers and build generative text about papers, this project focuses on providing human readers with context as they read.

pdf.js-hypothes.is: https://github.com/hypothesis/pdf.js-hypothes.is:

> This is a copy of Mozilla's PDF.js viewer with Hypothesis annotation tools added

Hypothesis is built on the W3C Web Annotations spec.

dokieli implements W3C Web Annotations and many other Linked Data Specs: https://github.com/dokieli/dokieli :

> Implements versioning and has the notion of immutable resources.

> Embedding data blocks, e.g., Turtle, N-Triples, JSON-LD, TriG (Nanopublications).

A dokieli document interface to LLMs would be basically the anti-PDF.

Rust crates: rayon handles parallel processing, pdf-rs, tesseract (C++)

pdf-rs examples/src/bin/extract_page.rs: https://github.com/pdf-rs/pdf/blob/master/examples/src/bin/e...