Remix.run Logo
Ask HN: What's the current best local/open speech-to-speech setup?
69 points by dsrtslnd23 16 hours ago | 15 comments

I’m trying to do the “voice assistant” thing fully locally: mic → model → speaker, low latency, ideally streaming + interruptible (barge-in).

Qwen3 Omni looks perfect on paper (“real-time”, speech-to-speech, etc). But I’ve been poking around and I can’t find a single reproducible “here’s how I got the open weights doing real speech-to-speech locally” writeup. Lots of “speech in → text out” or “audio out after the model finishes”, but not a usable realtime voice loop. Feels like either (a) the tooling isn’t there yet, or (b) I’m missing the secret sauce.

What are people actually using in 2026 if they want open + local voice?

Is anyone doing true end-to-end speech models locally (streaming audio out), or is the SOTA still “streaming ASR + LLM + streaming TTS” glued together?

If you did get Qwen3 Omni speech-to-speech working: what stack (transformers / vLLM-omni / something else), what hardware, and is it actually realtime?

What’s the most “works today” combo on a single GPU?

Bonus: rough numbers people see for mic → first audio back

Would love pointers to repos, configs, or “this is the one that finally worked for me” war stories.

mpaepper 6 hours ago | parent | next [-]

You should look into the new Nvidia model: https://research.nvidia.com/labs/adlr/personaplex/

It has dual channel input / output and a very permissible license

cbrews 4 hours ago | parent | next [-]

Thanks for sharing this! I'm going to put this on my list to play around with. I'm not really an expert in this tech, I come from the audio background, but recently was playing around with streaming Speech-to-Text (using Whisper) / Text-to-Speech (using Kokoro at the time) on a local machine.

The most challenging part in my build was tuning the inference batch sizing here. I was able to get it working well for Speech-to-Text down to batch sizes of 200ms. I even implement a basic local agreement algorithm and it was still very fast (inferencing time, I think, was around 10-20ms?). You're basically limited by the minimum batch size, NOT inference time. Maybe that's a missing "secret sauce" suggested in the original post?

In the use case listed above, the TTS probably isn't a bottleneck as long as OP can generate tokens quickly.

All this being said a wrapped model like this that is able to handle hand-offs between these parts of the process sounds really useful and I'll definitely be interested in seeing how it performs.

Let me know if you guys play with this and find success.

dsrtslnd23 6 hours ago | parent | prev [-]

oh - very interesting indeed! thanks

marsbars241 2 hours ago | parent | prev | next [-]

Tangential: What hardware are you using for the interface on these? Is there a good array microphone that performs on par with echos/ghomes/homepods?

amelius 4 hours ago | parent | prev | next [-]

For the TTS part: https://github.com/supertone-inc/supertonic

varik77 2 hours ago | parent | prev | next [-]

I have used https://github.com/SaynaAI/sayna . What I like the most is that you can switch between the providers easily and see what works for you the best. It also supports local models.

dfajgljsldkjag 3 hours ago | parent | prev | next [-]

It requires a bit of tinkering, but I think pipecat is the way to go. You can plug in pretty much any STT/LLM/TTS you want and go. It definitely supports local models but its up to you to get your hands on those models.

Not sure if there's any turnkey setups that are preconfigured for local install where you can just press play and go though.

Last I heard E2E speech to speech models are still pretty weak. I've had pretty bad results from gpt-realtime and that's a proprietary model, I'm assuming open source is a bit behind.

hedgehog an hour ago | parent | prev | next [-]

I haven't tried them myself but the Kyutai has a couple projects that could fit.

https://kyutai.org

Johnny_Bonk 4 hours ago | parent | prev | next [-]

Anyone using any reasonably good small speech to text os models?

garblegarble 4 hours ago | parent [-]

For my inputs, whisper distil-large-v3.5 is the best. I tried Parakeet 0.6 v3 last night but it has higher error rates than I'd like (but it is fast...)

Johnny_Bonk 4 hours ago | parent | next [-]

Nice I'll try it, as of now for my personal stt workflow I use eleven labs api which is pretty generous but curious to play around with other options

garblegarble 4 hours ago | parent [-]

I assume that will be better than whisper - I haven't benchmarked it against cloud models, the project I'm working on cannot send data out to cloud models

BiraIgnacio 4 hours ago | parent | prev [-]

oh I've been looking into whisper and vosk in the last few days. I'll probably go with whisper (with whisper.cpp) but has anyone compared it to vosk models?

DANmode 2 hours ago | parent | prev | next [-]

https://handy.computer got good marks from a very nontechnical user in my life this week!

Local, FOSS

jauntywundrkind 6 hours ago | parent | prev [-]

It was a little annoying getting old qt5 tools installed but I really enjoyed using dsnote / Speech Note. Huge model selection for my amd gpu. Good tool. I haven't done enough specific studying yet to give you suggestions for which model to go with. WhisperFlow is very popular.

Kyutai some very interesting work always. Their delayed streams work is bleeding edge & sounds very promising especially for low latency. Not sure why I have not yet tried it tbh. https://github.com/kyutai-labs/delayed-streams-modeling

There's also a really nice elegant simple app Handy. Only supports Whisper and Parakeet V3 but nice app & those are amazing models. https://github.com/cjpais/Handy