What’s interesting about full-duplex speech systems isn’t just the model itself, but the pipeline latency.

Even if each component is fast individually, the chain of audio capture → feature extraction → inference → decoding → synthesis can quickly add noticeable delay.

Getting that entire loop under ~200–300ms is usually what makes the interaction start to feel conversational instead of “assistant-like”.

▲

8 hours ago | parent | next [-]

[deleted]

▲

sigmoid10 8 hours ago | parent | prev [-]

That's why this model and all the other ones serious about realtime speech don't use such a pipeline and instead process raw audio. The most realistic approach is probably a government mandated, real name online identity verification system, and that comes with its very own set of fundamental issues. You can't have the freedom of the web and the accountability of the physical world at the same time.

	▲	exe34 8 hours ago \| parent [-]
		this is amazing - it reminds me of the time when LLM precursors were able to babble in coherent English, but would just write nonsense.