Remix clone Hacker News

new | show | ask | jobs Github

	▲	zozbot234 5 hours ago
		This is not voice-to-voice though, end-to-end voice chat models (the Her UX) are completely different.
	▲	dust42 4 hours ago \| parent [-]
		I haven't found any end-to-end voice chat models useful. I had much better results with separate STT-LLM-TTS. One big problem is the turn detection and having inference with 150-200ms latency would allow for a whole new level of quality. I would just use it with a prompt: "You think the user is finished talking?" and then push it to a larger model. The AI should reply within the ballpark of 600ms-1000ms. Faster is often irritating, slower will make the user to start talking again.