▲ | alsetmusic 13 hours ago | |
Only two weeks ago, I was trying to save an online-only book of 24 chapters. The text is filled with images to help illustrate and contextualize the content. I individually saved each chapter as PDFs and ran a few different command line tools to try to extract the contents to plain text. They all came out badly disjointed. Even tools that were meant to do what this paper describes failed miserably at reconstructing naturally flowing text. While this isn't something I need on a regular basis, it's timely news to hear about someone making progress on what seems like it ought to be a straightforward problem to solve. As the results of my efforts show, it must not be nearly as simple as one might expect. | ||
▲ | jimmySixDOF 9 hours ago | parent | next [-] | |
Docling from IBM and Markitdown from Microsoft are reasonably reliable if you didn't try them also take the extra step to get image summaries in plain text from a VLM it's useful of you want to feed final results to an LLM later. Or first try to skip all that with jina.reader or firecrawl llmstxt they will extract directly from the website so simple but sometimes it works sometimes it doesn't. | ||
▲ | Hnrobert42 11 hours ago | parent | prev [-] | |
You could try splitting the book into 1 page PDFs. Send to Gemini flash 2.5 and ask it to OCR to markdown format. It's about USD $0.006/page. It works well for one of my clients. |