▲ | sorenjan 4 hours ago | ||||||||||||||||
Using an LLM to correct text is a good idea, but the text transcript doesn't have information about how confident the speech to text conversion is. Whisper can output confidence for each word, this would probably make for a better pipeline. It would surprise me if Google doesn't do something like this soon, although maybe a good speech to text model is too computationally expensive for Youtube at the moment. | |||||||||||||||||
▲ | dylan604 4 hours ago | parent [-] | ||||||||||||||||
Depends on your purpose of the transcript. If you are expecting the exact form of the words spoken in written form, then any deviation from that is no longer a transcription. At that point it is text loosely based on the spoken content. Once you accept it okay for the LLM to just replace words in a transcript, you might as well just let it make up a story based on character names you've provided. | |||||||||||||||||
|