| ▲ | nodja 6 hours ago | |
It's probably another ASR model that focuses on benchmarks and simple uses instead of more challenging real use cases. I upload edited gameplay vods of twitch streams on youtube, and use whisper-large-v3 to provide subtitles for accessibility reasons (youtube's own auto-subtitles suck, tho they've been getting better). My checklist for a good ASR model for my use case is: 1. Have timestamp support. 2. Support overlapping speakers. 3. Accurate transcripts that don't coalesce half words/interrupted sentences. 4. Support non verbal stuff like [coughs], [groans], [laughs], [sighs], etc. 5. Allow context injection of non-trivial sizes (10k+ words) 1 is obvious because without it we can't have subtitles. Force alignment fails too often. 2 is crucial for real world scenarios because in the real world people talk over each other all the time, in my case it's a streamer talking over gameplay audio, or when the streamer has guests over. When 2 people speak the transcript either ignores one of them, or in the worst case, both of them. 3 and 4 are an accessibility thing, if you're deaf or hard of hearing having a more literal transcript of what's being said conveys better how the speaker is speaking. If all subtitles are properly "spell-checked" then it's clear your model is overfit to the benchmarks. 5 Is not a requirement per se, but more of a nice to have. In my use cause the streamer is often reading stream chat so feeding the model the list of users that recently talked, recent chat messages, text on screen, etc. Would make for more accurate transcripts. I've tried many models, and the closest that fulfill my needs are LLM style models on top of forced alignment. It's too slow, so I've been sticky with whisper because with whisperx I can get a transcript in 5 minutes with just a single command. One thing all these models do (including whisper) is just omit full sentences, it's the worst thing a model can do. | ||
| ▲ | Nimitz14 2 hours ago | parent | next [-] | |
3/4 are actually negative value for most customers | ||
| ▲ | satozawa 3 hours ago | parent | prev [-] | |
[dead] | ||