sorenjan 3 months ago

Using an LLM to correct text is a good idea, but the text transcript doesn't have information about how confident the speech to text conversion is. Whisper can output confidence for each word, this would probably make for a better pipeline. It would surprise me if Google doesn't do something like this soon, although maybe a good speech to text model is too computationally expensive for Youtube at the moment.

dylan604 3 months ago | parent [-]

Depends on your purpose of the transcript. If you are expecting the exact form of the words spoken in written form, then any deviation from that is no longer a transcription. At that point it is text loosely based on the spoken content.

Once you accept it okay for the LLM to just replace words in a transcript, you might as well just let it make up a story based on character names you've provided.

falcor84 3 months ago | parent [-]

> any deviation from that is no longer a transcription

That's a wild exaggeration. Professional transcripts often have small (and not so small) mistakes, caused by typos, mishearing or lack of familiarity with the subject matter. Depending on the case, these are then manually proofread, but even after proofreading, some mistakes often remain, and occasionally even introduced.

dylan604 3 months ago | parent [-]

maybe, but typos are not even the same thing as an LLM thinking of better next choice in words than actually just transcribing what was heard.