| ▲ | ljclifford 3 hours ago | |||||||||||||||||||||||||||||||||||||||||||
actually the hardest part of a locally hosted voice assistant isn't the llm. it's making the tts tolerable to actually talk to every day. the core issue is prosody: kokoro and piper are trained on read speech, but conversational responses have shorter breath groups and different stress patterns on function words. that's why numbers, addresses, and hedged phrases sound off even when everything else works. the fix is training data composition. conversational and read speech have different prosody distributions and models don't generalize across them. for self-hosted, coqui xtts-v2 [1] is worth trying if you want more natural english output than kokoro. btw i'm lily, cofounder of rime [2]. we're solving this for business voice agents at scale, not really the personal home assistant use case, but the underlying problem is the same. | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | bachittle 31 minutes ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||
Coqui TTS is actually deprecated, the company shut down. I have a voice assistant that is using gpt-5.4 and opus 4.6 using the subsidized plans from Codex and Claude Code, and it uses STT and TTS from mlx-audio for those portions to be locally hosted: https://github.com/Blaizzy/mlx-audio Here are the following models I found work well: - Qwen ASR and TTS are really good. Qwen ASR is faster than OpenAI Whisper on Apple Silicon from my tests. And the TTS model has voice cloning support so you can give it any voice you want. Qwen ASR is my default. - Chatterbox Turbo also does voice cloning TTS and is more efficient to run than Qwen TTS. Chatterbox Turbo is my default. - Kitten TTS is good as a small model, better than Kokoro - Soprano TTS is surprisingly really good for a small model, but it has glitches that prevent it from being my default But overall the mlx-audio library makes it really easy to try different models and see which ones I like. | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | buildsjets 9 minutes ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
Can you make it sound just like Titus Moody? I want to hear your voice assistant say "No sir, I don't hold with furniture that talks." | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | cdcarter 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
80% of my home voice assistant requests really need no response other than an affirmative sound effect. | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | cptskippy 2 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||
> actually the hardest part of a locally hosted voice assistant isn't the llm. it's making the tts tolerable to actually talk to every day. I would argue that the hardest part is correctly recognizing that it's being addressed. 98% of my frustration with voice assistants is them not responding when spoken to. The other 2% is realizing I want them to stop talking. | ||||||||||||||||||||||||||||||||||||||||||||