Remix.run Logo
ph4evers a day ago

Whisper works on 30 second chunks. So yes it can do that and that’s also why it can hallucinate quite a bit.

jeroenhd a day ago | parent | next [-]

The ffmpeg code seems to default to three second chunks (https://ffmpeg.org/ffmpeg-filters.html#whisper-1):

    queue
    
         The maximum size that will be queued into the filter before processing the audio with whisper. Using a small value the audio stream will be processed more often, but the transcription quality will be lower and the required processing power will be higher. Using a large value (e.g. 10-20s) will produce more accurate results using less CPU (as using the whisper-cli tool), but the transcription latency will be higher, thus not useful to process real-time streams. Consider using the vad_model option associated with a large queue value. Default value: "3"
londons_explore a day ago | parent [-]

so if "I scream" is in one chunk, and "is the best dessert" is in the next, then there is no way to edit the first chunk to correct the mistake? That seems... suboptimal!

I don't think other streaming transcription services have this issue since, whilst they do chunk up the input, past chunks can still be edited. They tend to use "best of N" decoding, so there are always N possible outputs, each with a probability assigned, and as soon as one word is the same in all N outputs then it becomes fixed.

The internal state of the decoder needs to be duplicated N times, but that typically isn't more than a few kilobytes of state so N can be hundreds to cover many combinations of ambiguities many words back.

miki123211 a day ago | parent | next [-]

The right way to do this would be to use longer, overlapping chunks.

E.g. do thranscription every 3 seconds, but transcribe the most recent 15s of audio (or less if it's the beginning of the recording).

This would increase processing requirements significantly, though. You could probably get around some of that with clever use of caching, but I don't think any (open) implementation actually does that.

superluserdo a day ago | parent | next [-]

I basically implemented exactly this on top of whisper since I couldn't find any implementation that allowed for live transcription.

https://tomwh.uk/git/whisper-chunk.git/

I need to get around to cleaning it up but you can essentially alter the number of simultaneous overlapping whisper processes, the chunk length, and the chunk overlap fraction. I found that the `tiny.en` model is good enough with multiple simultaneous listeners to be able to have highly accurate live English transcription with 2-3s latency on a mid-range modern consumer CPU.

dylan604 19 hours ago | parent | prev [-]

If real-time transcription is so bad, why force it to be real-time. What happens if you give it a 2-3 second delay? That's pretty standard in live captioning. I get real-time being the ultimate goal, but we're not there yet. So working within the current limitations is piss poor transcription in real-time really more desirable/better than better transcriptions 2-3 second delay?

jeroenhd 16 hours ago | parent | prev | next [-]

I don't know an LLM that does context based rewriting of interpreted text.

That said, I haven't run into the icecream problem with Whisper. Plenty of other systems fail but Whisper just seems to get lucky and guess the right words more than anything else.

The Google Meet/Android speech recognition is cool but terribly slow in my experience. It also has a tendency to over-correct for some reason, probably because of the "best of N" system you mention.

llarsson a day ago | parent | prev | next [-]

Attention is all you need, as the transformative paper (pun definitely intended) put it.

Unfortunately, you're only getting attention in 3 second chunks.

abdullahkhalids 17 hours ago | parent | prev | next [-]

Which other streaming transcription services are you referring to?

londons_explore 13 hours ago | parent [-]

Googles speech to text API: https://cloud.google.com/speech-to-text/docs/speech-to-text-...

The "alternatives" and "confidence" field is the result of the N-best decodings described elsewhere in the thread.

no_wizard 21 hours ago | parent | prev [-]

That’s because at the end of the day this technology doesn’t “think”. It simply holds context until the next thing without regard for the previous information

anonymousiam a day ago | parent | prev | next [-]

Whisper is excellent, but not perfect.

I used Whisper last week to transcribe a phone call. In the transcript, the name of the person I was speaking with (Gem) was alternately transcribed as either "Jim" or "Jem", but never "Gem."

JohnKemeny 21 hours ago | parent | next [-]

Whisper supports adding a context, and if you're transcribing a phone call, you should probably add "Transcribe this phone call with Gem", in which case it would probably transcribe more correctly.

ctxc 20 hours ago | parent [-]

Thanks John Key Many!

t-3 19 hours ago | parent | prev [-]

That's at least as good as a human, though. Getting to "better-than-human" in that situation would probably require lots of potentially-invasive integration to allow the software to make correct inferences about who the speakers are in order to spell their names correctly, or manually supplying context as another respondent mentioned.

anonymousiam 16 hours ago | parent [-]

When she told me her name, I didn't ask her to repeat it, and I got it right through the rest of the call. Whisper didn't, so how is this "at least s good as a human?"

t-3 16 hours ago | parent [-]

I wouldn't expect any transcriber to know that the correct spelling in your case used a G rather than a J - the J is far more common in my experience. "Jim" would be an aberration that could be improved, but substitution "Jem" for "Gem" without any context to suggest the latter would be just fine IMO.

0points a day ago | parent | prev [-]

So, yes, and also no.