| ▲ | zozbot234 5 hours ago | |
This is not voice-to-voice though, end-to-end voice chat models (the Her UX) are completely different. | ||
| ▲ | dust42 4 hours ago | parent [-] | |
I haven't found any end-to-end voice chat models useful. I had much better results with separate STT-LLM-TTS. One big problem is the turn detection and having inference with 150-200ms latency would allow for a whole new level of quality. I would just use it with a prompt: "You think the user is finished talking?" and then push it to a larger model. The AI should reply within the ballpark of 600ms-1000ms. Faster is often irritating, slower will make the user to start talking again. | ||