Remix.run Logo
sjsdaiuasgdia 6 days ago

LLM-based OCR and speech transcription do come with a failure condition that is different than you see in pre-LLM solutions. When the source data is hard to understand, LLMs try to fill the gap with something that makes sense given the surrounding context.

Pre-LLM approaches handle unintelligible source data differently. You'll more commonly see nonsense output for the unintelligible bits. In some cases the tool might be capable of recognizing low confidence and returning an error or other indicator of a possible miss.

IMO, that's a feature. The LLM approach makes up something that looks right but may not actually match the source data. These errors are far harder to detect and more likely to make it past human review.

The LLM approach does mean that you can often get a more "complete" output from a low quality data source vs pre-LLM approaches. And sometimes it might even be correct! But it will get it wrong other times.

Another failure condition I've experienced with LLM-based voice transcription that I didn't have pre-LLM - running down the wrong fork in the road. Sometimes the LLM approaches will get a word or two wrong...words with similar phonetics or multiple meanings, that kind of thing. It may then continue down the path this mistaken context has created, outputting additional words that do not align to the source data at all.