Re the OCR, I'm currently running allenai/olmocr-2-7b against all the PDFs with text in them, comparing with the OCR DOJ provided, and a lot it doesn't match, and surprisingly olmocr-2-7b is quite good at this. However, after extracing the pages from the PDFs, I'm currently sitting on ~500K images to OCR, so this is currently taking quite a while to run through.

▲

originalvichy 4 hours ago | parent | next [-]

Did you take any steps to decrease the dimension size of images, if this increases the performance? I have not tried this as I have not peformed an OCR task like this with an LLM. I would be interested to know at what size the vlm cannot make out the details in text reliably.

	▲	embedding-shape 4 hours ago \| parent [-]
		The performance is OK, takes a couple of seconds at most on my GPU, just the amount of documents to get through that takes time, even with parallelism. The dimension seems fine as it is, as far as I can tell.

▲

helterskelter 4 hours ago | parent | prev [-]

[flagged]

▲

embedding-shape 4 hours ago | parent [-]

Haven't seen anything particular about that, but lots of the documents with names that were half-redacted contain OCRd text that is completely garbled, but olmocr-2-7b seems to handle it just fine. Unsure if they just had sucky processes or if there is something else going on.

▲

helterskelter 4 hours ago | parent [-]

Might be a good fit for uploading a git repo and crowdsourcing

	▲	embedding-shape 3 hours ago \| parent [-]
		Was my first impulse too but not sure I trust that unless I could gather a bunch of people I trust, which would mean I'd no longer be anonymous. Kind of a catch22.