Remix.run Logo
simonw 14 hours ago

I tried the demo and it looks like you have to click Mic, then record your audio, then click "Stop and transcribe" in order to see the result.

Is it possible to rig this up so it really is realtime, displaying the transcription within a second or two of the user saying something out loud?

The Hugging Face server-side demo at https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim... manages that, but it's using a much larger (~8.5GB) server-side model running on GPUs.

refulgentis 13 hours ago | parent [-]

It's not fast enough to be realtime, though you could do a more advanced UI and a ring buffer and have it as you describe. (ex. I do this with Whisper in Flutter, and also inference GGUFs in llama.cpp via Dart)

This isn't even close to realtime on M4 Max. Whisper's ~realtime on any device post-2022 with an ONNX implementation. The extra inference cost isn't worth the WER decrease on consumer hardware, or at least, wouldn't be worth the time implementing.