Remix.run Logo
dredmorbius 8 hours ago

Shortcutting much of the discussion here (what are you goals / why would you do that / don't use format X): a key problem is that neither HTML (as published on today's Web) nor PDF are reliable as canonical document formats. Tagged-markup such as Markdown (or otherlightweight markup languages) or LaTeX (or other heavy markup languages) are far more robust. Markdown has its variants, but all are pretty simple and easy to produce. LaTeX is slightly more complex, but remains quite straightforward for simple works.

Once you've got an appropriate canonical version in any of these options, you have an embarassment of riches to convert to any given document format (what I call endpoints) you'd care for: PDF, HTML, RTF, DOCX, or many, many others. I generally reach for Pandoc first, which itself, yes, of course, often relies on additional tools/libraries to parse or generate endpoints, but is quite versatile.

You can simplify the intake of HTML by stripping out cruft. Readability, Beautiful Soup, or other HTML filtering tools can target the core content and metadata you most likely want.

Otherwise, think through what you're doing and why to more narrowly define your goals and tools. E.g., if you want a faithful printed representation of a mainstream-browser-rendered page (that is, Google Chrome), you'd probably do best to use its print-to-PDF options (mentioned several times here). If you want to extract core text, filtering out much of today's WWW cruft will be a high priority.