| ▲ | weitendorf 10 hours ago | ||||||||||||||||
Hey I was literally just working on this today (I was racing ahead on an audio FT myself but OP beat me by a few hours). For audio inference definitely try running your input through VAD first to drop junk data and if necessary, as one of several preprocessing steps before sending the audio to the large model. You can check out how I did it here: https://github.com/accretional/vad/blob/main/pkg/vad/vad.go I was using https://huggingface.co/onnx-community/pyannote-segmentation-... because with ONNX, I could run it on Intel servers with vectorized instructions, locally on my Mac, AND in-browser with transformers.js VAD is absurdly time-effective (I think like O(10s) to segment 1hr of audio or something) and reduces the false positive rate/cost of transcription and multimodal inference since you can just pass small bits of segmented audio into another model specializing in that, then encode it as text before passing it to the expensive model. | |||||||||||||||||
| ▲ | MediaSquirrel 10 hours ago | parent [-] | ||||||||||||||||
Great minds think alike! Also, I had a huge head start, as I spent a month or two working on this in September 2025, shelved it and dusted it back off this weekend. | |||||||||||||||||
| |||||||||||||||||