▲ | huijzer 6 days ago | |||||||
Last time I checked a few months ago, LLMs were more accurate than the OCR that the archive is using. The web archive version is/was not using context to figure out that for example “in the garden was a trge” should be “in the garden was a tree”. LLMs depending on the prompt do this. | ||||||||
▲ | quuxplusone 5 days ago | parent [-] | |||||||
Perhaps. My perhaps-curmudgeonly take on that is that it sounds a bit like "Xerox scanners/photocopiers randomly alter numbers in scanned documents" ( https://news.ycombinator.com/item?id=29223815 ). I'd much rather deal with "In the garden was a trge" than "In the garden was a tree," for example, if what the page actually said was "In the garden was a tiger." That said, of course you're right that context is useful for OCRing. See for example https://history.stackexchange.com/questions/50249/why-does-n... Another, perhaps-leftpaddish argument is that by outsourcing the job to archive.org I'm allowing them to worry about the "best" way to OCR things, rather than spending my own time figuring it out. Wikisource, for example, seems to have gotten markedly better at OCRing pages over the past few years, and I assume that's because they're swapping out components behind the scenes. | ||||||||
|