Remix.run Logo
sorenjan 7 months ago

Using an LLM to correct text is a good idea, but the text transcript doesn't have information about how confident the speech to text conversion is. Whisper can output confidence for each word, this would probably make for a better pipeline. It would surprise me if Google doesn't do something like this soon, although maybe a good speech to text model is too computationally expensive for Youtube at the moment.