Remix clone Hacker News

new | show | ask | jobs Github

	▲	nvdnadj92 4 days ago
		I'm working on the same project myself and was planning to write a blog post similar to the author's. However, I'll share some additional tips and tricks that really made a difference for me. For preprocessing, I found it best to convert files to a 16kHz WAV format for optimal processing. I also add low-pass and high-pass filters to remove non-speech sounds. To avoid hallucinations, I run Silero VAD on the entire audio file to find timestamps where there's a speaker. A side note on this: Silero requires careful tuning to prevent audio segments from being chopped up and clipped. I also use a post-processing step to merge adjacent VAD chunks, which helps ensure cohesive Whisper recordings. For the Whisper task, I run Whisper in small audio chunks that correspond to the VAD timestamps. Otherwise, it will hallucinate during silences and regurgitate the passed-in prompt. If you're on a Mac, use the whisper-mlx models from Hugging Face to speed up transcription. I ran a performance benchmark, and it made a 22x difference to use a model designed for the Apple Neural Engine. For post-processing, I've found that running the generated SRT files through ChatGPT to identify and remove hallucination chunks has a better yield.
	▲	adzm 4 days ago \| parent \| next [-]
		I added EQ to a task after reading this and got much more accurate and consistent results using whisper, thanks for the obvious in retrospect tip.
	▲	bnmoch3 3 days ago \| parent \| prev \| next [-]
		Please can you share the prompt you use in ChatGPT to remove hallucination chunks
	▲	eevmanu 3 days ago \| parent \| prev [-]
		If I understood correctly, VAD has superior results than using ffmpeg silencedetect + silentremove, right? I think latest version of ffmpeg could use whisper with VAD[1], but I still need to explore how with a simple PoC script I'd love to know more about the post-processing prompt, my guess is that looks like an improved version of `semantic correction` prompt[2], but I may be wrong ¯\_(ツ)_/¯ . [1] https://ffmpeg.org/ffmpeg-filters.html#toc-whisper-1 [2] https://gist.github.com/eevmanu/0de2d449144e9cd40a563170b459...