Remix clone Hacker News

new | show | ask | jobs Github

	▲	robrenaud 4 hours ago
		What do y'all think about the latency/quality tradeoff with LLMs? Human voices don't take 30 seconds to think, retrieve, research, and summarize a high quality answer. Humans are calibrated in their knowledge, they know what they understand and what they don't. They can converse in real time without bullshitting. Frontier real time-ish LLM generated voice systems are still plagued by 2024 era LLM nonsense, like the inability to count Rs in strawberry. [1] I'd personally love a voice interface that, constrained by the technology of today, takes the latency hit to deliver quality. [1] https://www.instagram.com/reel/DTYBpa7AHSJ/?igsh=MzRlODBiNWF...
	▲	navanchauhan 3 hours ago \| parent [-]
		Not affiliated with Sesame, but this is what the realtime models are trying to solve. If you look at NVIDIA’s PersonaPlex release [0], it uses a duplex architecture. It’s based on Moshi [1], which aims to address this problem by allowing the model to listen and generate audio at the same time. [0] https://github.com/NVIDIA/personaplex [1] https://arxiv.org/abs/2410.00037