Remix.run Logo
Asterisk AI Voice Agent(github.com)
77 points by akrulino 7 hours ago | 37 comments
wild_egg 4 hours ago | parent | next [-]

The baseline configurations all note <2s and <3s times. I haven't tried any voice AI stuff yet but a 3s latency waiting on a reply seems rage inducing if you're actually trying to accomplish something.

Is that really where SOTA is right now?

dnackoul an hour ago | parent | next [-]

I've generally observed latency of 500ms to 1s with modern LLM-based voice agents making real calls. That's good enough to have real conversations.

I attended VAPI Con earlier this year, and a lot of the discussion centered on how interruptions and turn detection are the next frontier in making voice agents smoother conversationalists. Knowing when to speak is a hard problem even for humans, but when you listen to a lot of voice agent calls, the friction point right now tends to be either interrupting too often or waiting too long to respond.

The major players are clearly working on this. Deepgram announced a new SOTA (Flux) for turn detection at the conference. Feels like an area where we'll see even more progress in the next year.

duckkg5 3 hours ago | parent | prev | next [-]

Absolutely not.

500-1000ms is borderline acceptable.

Sub-300ms is closer to SOTA.

2000ms or more means people will hang up.

fragmede an hour ago | parent [-]

play "Just a second, one moment please <sounds of typing>".wave as soon as input goes quiet.

ChatGPT app has a audio version of the spinner icon when you ask it a question and it needs a second before answering.

echelon 16 minutes ago | parent | prev | next [-]

Sesame was the fastest model for a bit. Not sure what that team is doing anymore, they kind of went radio silent.

https://app.sesame.com/

coderintherye 3 hours ago | parent | prev | next [-]

Microsoft Foundry's realtime voice API (which itself is wrapping AI models from the major players) has response times in the milliseconds.

wellthisisgreat 4 hours ago | parent | prev [-]

No, there are models with sub-second latency for sure

aftbit 4 hours ago | parent | prev | next [-]

This opens up new possibilities for interactive phone services. Retro-futuristic for sure.

looneysquash 4 hours ago | parent | prev | next [-]

That seems like bad news for Allison. Though I know she already had some TTS voices available, so many not.

eugene3306 4 hours ago | parent | prev | next [-]

I've created Asterisk Codex Skill, but turns out there is ten seconds timeout for scripts

krater23 5 hours ago | parent | prev | next [-]

Please don't. I had a talk with a shitty AI bot on a Fedex line. It's absolute crap. Just give me a 'Type 1 for x, type 2 for y'. Then I don't need to guess what are the possibilities.

EvanAnderson 5 hours ago | parent | next [-]

Voice-controlled phone systems are hugely rage-inducing for me. I am often in loud setting with background chatter. Muting my audio and using a touchtone keypad is so much more accurate and easy than having to find a quiet place and worrying that somebody is going to say something that the voice response system detects.

9x39 4 hours ago | parent | prev | next [-]

One problem is once you’re in deep building a phone IVR workflow beyond X or Y (yes, these are intentional), callers don’t care about some deep and featured input menu. They just mash 0 or pick a random option and demand a human finish the job and transfer them - understandably.

When you’re committed to phone intent complexity (hell), the AI assisted options are sort of less bad since you don’t have to explain the menu to callers, they just make demands.

tartoran 4 hours ago | parent [-]

What if the goal is to keep gaslighting you until you give up your demands?

9x39 4 hours ago | parent | next [-]

Most voice agents for large companies are a calculated game to deter customers from expensive humans as we know, but not always.

Sort of like how Jira can be a streamlined tool or a prison of 50-step workflows, it's all up to the designer.

8note 2 hours ago | parent | prev [-]

you bought something from the wrong company, and you arent gonna get helped by phone, bot, or person

cyberax 2 hours ago | parent | prev [-]

Well, the future is here: https://www.youtube.com/watch?v=HbDnxzrbxn4

nextworddev 5 hours ago | parent | prev | next [-]

Can I connect this to Twilio

kwindla 4 hours ago | parent | next [-]

One easy way to build voice agents and connect them to Twilio is the Pipecat open source framework. Pipecat supports a wide variety of network transports, including the Twilio MediaStream WebSocket protocol so you don't have to bounce through a SIP server. Here's a getting started doc.[1]

(If you do need SIP, this Asterisk project looks really great.)

Pipecat has 90 or so integrations with all the models/services people use for voice AI these days. NVIDIA, AWS, all the foundation labs, all the voice AI labs, most of the video AI labs, and lots of other people use/contribute to Pipecat. And there's lots of interesting stuff in the ecosystem, like the open source, open data, open training code Smart Turn audio turn detection model [2], and the Pipecat Flows state machine library [3].

[1] - https://docs.pipecat.ai/guides/telephony/twilio-websockets [2] - https://github.com/pipecat-ai/pipecat-flows/ [3] - https://github.com/pipecat-ai/smart-turn

Disclaimer: I spend a lot of my time working on Pipecat. Also writing about both voice AI in general and Pipecat in particular. For example: https://voiceaiandvoiceagents.com/

ldenoue 3 hours ago | parent | next [-]

The problem with PipeCat and LiveKit (the 2 major stacks for building voice ai) is the deployment at scale.

That’s why I created a stack entirely in Cloudflare workers and durable objects in JavaScript.

Providers like AssemblyAI and Deepgram now integrate VAD in their realtime API so our voice AI only need networking (no CPU anymore).

nextworddev 2 hours ago | parent [-]

let me get this straight, you are storing convo threads / context in DOs?

e.g. Deepgram (STT) via websocket -> DO -> LLM API -> TTS?

nextworddev 3 hours ago | parent | prev [-]

This is good stuff.

In your opinion, how close is Pipecat + OSS to replacing proprietary infra from Vapi, Retell, Sierra, etc?

ldenoue 3 hours ago | parent | prev | next [-]

I developed a stack on Cloudflare workers where latency is super low and it is cheap to run at scale thanks to Cloudflare pricing.

Runs at around 50 cents per hour using AssemblyAI or Deepgram as the STT, Gemini Flash as LLM and InWorld.ai as the TTS (for me it’s on par with ElevenLabs and super fast)

pugio 2 hours ago | parent [-]

Do you have anything written up about how you're doing this? Curious to learn more...

VladVladikoff 5 hours ago | parent | prev [-]

Technically yes, twilio has sip trunks.

johnebgd 6 hours ago | parent | prev [-]

I welcome the spam calls from our asterisk overlords.

haroldp an hour ago | parent | next [-]

I was more thinking I could add it to my Asterisk server to honey-pot the spam callers into an infinite time waster cycle.

VladVladikoff 5 hours ago | parent | prev [-]

I’m honestly surprised it hasn’t been more prevalent yet. I still get call centre type spam calls where you can hear all the background noise of the rest of the call centre.

userbinator 4 hours ago | parent [-]

Is the background noise real, or is it also AI-generated to make you think that it's a human?

tartoran 4 hours ago | parent [-]

The background noise is a recording for sure, no AI needed, just a background noise audiofile in a loop would do.

VladVladikoff 4 hours ago | parent [-]

Why though? It adds nothing positive, it only makes me sure it is a scam call.

the_af 3 hours ago | parent [-]

I assume it's to make it seem like an actual call center rather than a scam. I recently got two phone scam attempts (credit card related) that sounded exactly like this.

ldenoue 2 hours ago | parent | next [-]

I built a voice AI stack and background noise can be really helpful to a restaurant AI for example. Italian background music or cafe background is part of the brand. It’s not meant to make the caller believe this is not a bot but only to make the AI call on brand.

grim_io an hour ago | parent [-]

You can call it what ever you like, but to me this is deceptive.

Where is the difference between this and Indian support staff pretending to be in your vicinity by telling you about the local weather? Your version is arguably even worse because it can plausibly fool people more competently.

SoftTalker 2 hours ago | parent | prev [-]

you actually answer unknown callers?

Loughla 2 hours ago | parent | next [-]

Yes. I own a business.

the_af 2 hours ago | parent | prev [-]

Yes. Sometimes it's a legit call. Not often, though.

Example of legit calls: the pizza delivery guy decided to call my phone instead of ringing the bell, for whatever reason.