▲ | miki123211 a day ago | |
The right way to do this would be to use longer, overlapping chunks. E.g. do thranscription every 3 seconds, but transcribe the most recent 15s of audio (or less if it's the beginning of the recording). This would increase processing requirements significantly, though. You could probably get around some of that with clever use of caching, but I don't think any (open) implementation actually does that. | ||
▲ | superluserdo a day ago | parent | next [-] | |
I basically implemented exactly this on top of whisper since I couldn't find any implementation that allowed for live transcription. https://tomwh.uk/git/whisper-chunk.git/ I need to get around to cleaning it up but you can essentially alter the number of simultaneous overlapping whisper processes, the chunk length, and the chunk overlap fraction. I found that the `tiny.en` model is good enough with multiple simultaneous listeners to be able to have highly accurate live English transcription with 2-3s latency on a mid-range modern consumer CPU. | ||
▲ | dylan604 19 hours ago | parent | prev [-] | |
If real-time transcription is so bad, why force it to be real-time. What happens if you give it a 2-3 second delay? That's pretty standard in live captioning. I get real-time being the ultimate goal, but we're not there yet. So working within the current limitations is piss poor transcription in real-time really more desirable/better than better transcriptions 2-3 second delay? |