Remix.run Logo
cwnyth 7 hours ago

Not all machine learning is generative AI.

mc32 7 hours ago | parent [-]

True but like regular document scanning software there can be errors in detection.

selcuka 32 minutes ago | parent | next [-]

Yeah. There was a weird Xerox printer bug that swapped digits (turning 6s into 8s) on scanned documents caused by the JBIG2 image format [1].

[1] https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

dleeftink 7 hours ago | parent | prev [-]

Just as with redacted documents (consistently blocked terms) or bad OCR jobs (wrong or missing characters), even if only a certain percentage comes out unmangled it is more readable than having no data at all.

A stable base corpus and some dynamic programming will allow you to clean up the remainder[0].

[0]: http://stackoverflow.com/a/11642687/2449774

mkl 5 hours ago | parent [-]

The problem is when you can't tell which bits are unmangled. OCR systems will happily give you plausible but wrong readings, and even some scanners/copiers will change things: https://dkriesel.com/en/blog/2013/0802_xerox-workcentres_are...