▲ | crazylogger 6 days ago | |
https://pure.md is exactly what you're looking for. But stripping complex formats like html & pdf down to simple markdown is a hard problem. It's nearly impossible to infer what the rendered page looks like by looking at the raw html / pdf code. https://github.com/mozilla/readability helps but it often breaks down over unconventional div structures. I heard the state of the art solution is using multimodal LLM OCR to really look at the rendered page and rewrite the thing in markdown. Which makes me wonder: how did OpenAI make their model read pdf, docx and images at all? |