I've been playing with whisper to try to do local transcription of long videos, but one issue I've found is that long (>15 seconds) spans without any speech tend to send it into a hallucination loops that it often can't recover from. I wonder if, with direct integration into ffmpeg, they will be able to configure it in a way that can improve that situation.

▲

franga2000 a day ago | parent | next [-]

Whisper is supposed to be used with voice activity detection and all production implementations that I've seen do that. The raw model is known to make up nonsense for silence because, as I understand it, it was never trained not to do that, assuming everyone will use VAD

	▲	a day ago \| parent [-]
		[deleted]

▲

42lux a day ago | parent | prev [-]

You usually delete silence before using something like whisper.

▲

re a day ago | parent | next [-]

I've heard that, but that doesn't sound like a useful approach for videos where (1) non-speech segments can have plenty of other sound (music, noise) and (2) you want timestamps to match up with the original video, like for subtitles. But maybe there are known mitigations for both of those issues that I'm not aware of. And if they do exist maybe they can be included in the ffmpeg whisper integration.

▲

miki123211 a day ago | parent [-]

By "delete", people mostly mean "detect", so that you can avoid processing such segments through Whisper. There's no reason to actually cut the silence out from the original audio file.

	▲	21 hours ago \| parent [-]
		[deleted]

▲

hnlmorg a day ago | parent | prev [-]

This is designed for real time use too. And in such cases, you couldn’t delete the silence before use.

	▲	42lux a day ago \| parent [-]
		The ffmpeg implementation might be the example was not.