A reverse of this question; what is the best way to convert pdf to html? We are required by accessibility law to make our PDFs WCAG compliant however it would be easier to convert these to HTML.

▲

dredmorbius 8 hours ago | parent | next [-]

This reduces to parsing PDFs, which is an unsolved hard problem.

At low volumes, my preferred approach is to select and extract text (copy/paste, perhaps using the poppler library for larger-scale work), dump that to plain-text and convert that (manually / scripted) to Markdown. From there you can get to PDF or pretty much any other format through tools such as pandoc.

▲

fschuett 14 hours ago | parent | prev | next [-]

Rendering to SVG, at least that's what I did on https://fschutt.github.io/printpdf/

I am currently writing a WASM-ready PDF toolkit that can handle both HTML to PDF and then rendering PDF pages to SVG. However, it's not yet production-ready.

The underlying HTML engine is currently a severe "work in progress", but it gives me the low-level access that I need: https://azul.rs/reftest

▲

bencornia a day ago | parent | prev [-]

I have been using pdf2htmlex with some success. https://github.com/pdf2htmlEX/pdf2htmlEX

	▲	drabbiticus a day ago \| parent [-]
		This is really cool, so thanks for sharing. Since the motivating goal for the question you are answering is WCAG compliance, is the output of pdf2htmlex meaningfully more WCAG compliant?