new | show | ask | jobs Github

ggnore7452 3 days ago

I’ve done a similar PDF → Markdown workflow.

For each page:

- Extract text as usual.

- Capture the whole page as an image (~200 DPI).

- Optionally extract images/graphs within the page and include them in the same LLM call.

- Optionally add a bit of context from neighboring pages.

Then wrap everything with a clear prompt (structured output + how you want graphs handled), and you’re set.

At this point, models like GPT-5-nano/mini or Gemini 2.5 Flash are cheap and strong enough to make this practical.

Yeah, it’s a bit like using a rocket launcher on a mosquito, but this is actually very easy to implement and quite flexible and powerfuL. works across almost any format, Markdown is both AI and human friendly, and surprisingly maintainable.

▲

GaggiX 3 days ago | parent [-]

>are cheap and strong enough to make this practical.

It all depends on the scale you need them, with the API it's easy to generate millions of tokens without thinking.

	▲	agentcoops 3 days ago \| parent \| next [-]
		You don't need full reasoning to get accurate results, so even with GPT5 it's still pretty cheap for a one-time job and easy to reason about costs. It's certainly cheaper if you have data where reliability is key and classical OCR will undoubtedly require some manual data cleaning... I can recommend the Mistral OCR API [1] if you have large jobs and don't want to think about it too much. [1] https://mistral.ai/solutions/document-ai
	▲	rdos 3 days ago \| parent \| prev [-]
		In that case you should run a model locally, this one for example: https://huggingface.co/ds4sd/docling-models