The baseline configurations all note <2s and <3s times. I haven't tried any voice AI stuff yet but a 3s latency waiting on a reply seems rage inducing if you're actually trying to accomplish something.

Is that really where SOTA is right now?

▲

dnackoul 3 hours ago | parent | next [-]

I've generally observed latency of 500ms to 1s with modern LLM-based voice agents making real calls. That's good enough to have real conversations.

I attended VAPI Con earlier this year, and a lot of the discussion centered on how interruptions and turn detection are the next frontier in making voice agents smoother conversationalists. Knowing when to speak is a hard problem even for humans, but when you listen to a lot of voice agent calls, the friction point right now tends to be either interrupting too often or waiting too long to respond.

The major players are clearly working on this. Deepgram announced a new SOTA (Flux) for turn detection at the conference. Feels like an area where we'll see even more progress in the next year.

▲

russdill an hour ago | parent | prev | next [-]

Been experimenting with having a local Home Assistant agent include a qwen 0.5B model to provide a quick response to indicate that the agent is "thinking" about the request. It seems to work ok for the use case, but it feels like it'd get really repetitive for a 2 way conversation. Another way to handle this would be to have the small model provide the first 3-5 words of a (non-commital) response and feed that in as part of the prompt to the larger model.

▲

duckkg5 5 hours ago | parent | prev | next [-]

Absolutely not.

500-1000ms is borderline acceptable.

Sub-300ms is closer to SOTA.

2000ms or more means people will hang up.

	▲	fragmede 3 hours ago \| parent [-]
		play "Just a second, one moment please <sounds of typing>".wave as soon as input goes quiet. ChatGPT app has a audio version of the spinner icon when you ask it a question and it needs a second before answering.

▲

coderintherye 5 hours ago | parent | prev | next [-]

Microsoft Foundry's realtime voice API (which itself is wrapping AI models from the major players) has response times in the milliseconds.

▲

echelon 2 hours ago | parent | prev | next [-]

Sesame was the fastest model for a bit. Not sure what that team is doing anymore, they kind of went radio silent.

https://app.sesame.com/

▲

5 hours ago | parent | prev | next [-]

[deleted]

▲

wellthisisgreat 5 hours ago | parent | prev [-]

No, there are models with sub-second latency for sure