Remix.run Logo
dust42 4 hours ago

I haven't found any end-to-end voice chat models useful. I had much better results with separate STT-LLM-TTS. One big problem is the turn detection and having inference with 150-200ms latency would allow for a whole new level of quality. I would just use it with a prompt: "You think the user is finished talking?" and then push it to a larger model. The AI should reply within the ballpark of 600ms-1000ms. Faster is often irritating, slower will make the user to start talking again.