Remix.run Logo
Dzugaru 8 hours ago

Outstanding work! I've participated in the challenge, but didn't get far. One of the questions I had at the time was - if I'm going to use ML to detect ink, could it invent hallucinated letters, or even parts of text, and how to prevent that?

verditelabs 8 hours ago | parent | next [-]

Yes, it's quite possible for ML to hallucinate ink, though it is on a much more local scale, like predicting a slightly longer stroke, filling in more of a character than is actually in the data, etc. Perhaps enough to change a reading of a character or show where ink isnt. It is difficult for ink detection to hallucinate grammatical and idiomatic greek and latin.

im3w1l 8 hours ago | parent [-]

What is the input to the ML algorithm? Does it know the surrounding context so that it has a chance to deduce "if this stroke is slightly longer then the end result will be idiomatic greek and latin"?

verditelabs 8 hours ago | parent [-]

The input is 3d chunks of reconstructed CT data from our scans. I can't remember the specifics but maybe enough voxels for .5mm^3 at a time or so? They're all available for free from https://registry.opendata.aws/vesuvius-challenge-herculaneum... . Our trained models are all available at https://huggingface.co/scrollprize

cwnyth 8 hours ago | parent | prev | next [-]

Not all machine learning is generative AI.

mc32 8 hours ago | parent [-]

True but like regular document scanning software there can be errors in detection.

dleeftink 8 hours ago | parent | next [-]

Just as with redacted documents (consistently blocked terms) or bad OCR jobs (wrong or missing characters), even if only a certain percentage comes out unmangled it is more readable than having no data at all.

A stable base corpus and some dynamic programming will allow you to clean up the remainder[0].

[0]: http://stackoverflow.com/a/11642687/2449774

mkl 6 hours ago | parent [-]

The problem is when you can't tell which bits are unmangled. OCR systems will happily give you plausible but wrong readings, and even some scanners/copiers will change things: https://dkriesel.com/en/blog/2013/0802_xerox-workcentres_are...

selcuka 2 hours ago | parent | prev [-]

Yeah. There was a weird Xerox printer bug that swapped digits (turning 6s into 8s) on scanned documents caused by the JBIG2 image format [1].

[1] https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

garethsprice 7 hours ago | parent | prev [-]

[dead]