I really like this, and have actually tried (unsuccessfully) to get PersonaPlex to run on my blackwell device - I will try this on Mac now as well.

There are a few caveats here, for those of you venturing in this, since I've spent considerable time looking at these voice agents. First is that a VAD->ASR->LLM->TTS pipeline can still feel real-time with sub-second RTT. For example, see my project https://github.com/acatovic/ova and also a few others here on HN (e.g. https://www.ntik.me/posts/voice-agent and https://github.com/Frikallo/parakeet.cpp).

Another aspect, after talking to peeps on PersonaPlex, is that this full duplex architecture is still a bit off in terms of giving you good accuracy/performance, and it's quite diffiult to train. On the other hand ASR->LLM->TTS gives you a composable pipeline where you can swap parts out and have a mixture of tiny and large LLMs, as well as local and API based endpoints.

▲

nowittyusername 7 hours ago | parent | next [-]

I've been working on building my own voice agent as well for a while and would love to talk to you and swap notes if you have the time. I have many things id like to discuss, but mainly right now im trying to figure out how a full duplex pipeline like this could fit in to an agentic framework. Ive had no issues with the traditional route of stt > llm > tts pipeline as that naturally lends itself with any agentic behavior like tool use, advanced context managemnt systems, rag , etc... I separate the human facing agent from the subagent to reduce latency and context bloat and it works well. While I am happy with the current pipeline I do always keep an eye out for full duplex solutions as they look interesting and feel more dynamic naturally because of the architecture, but every time i visit them i cant wrap my head how you would even begin to implement that as part of a voice agent. I mean sure you have text input and output channels in some of these things but even then with its own context limitations feels like they could never bee anything then a fancy mouthpiece. But this feels like im possibly looking at this from ignorance. anyways would love to talk on discord with a like minded fella. cheers.

	▲	ilaksh 5 hours ago \| parent [-]
		For my framework, since I am using it for outgoing calls, what I am thinking maybe is I will add a tool command call_full_duplex(number, persona_name) that will get personaplex warmed up and connected and then pause the streams, then connect the SIP and attach the IO audio streams to the call and return to the agent. Then send the deepgram and personaplex text in as messages during the conversation and tell it to call a hangup() command when personaplex says goodbye or gets off track, otherwise just wait(). It could also use speak() commands to take over with TTS if necessary maybe with a shutup() command first. Need a very fast and smart model for the agent monitoring the call.

▲

scotty79 an hour ago | parent | prev [-]

I got PersonaPlex to run on my laptop (a beefy one) just by following the step by step instruction on their github repo.

The uncanny thing is that it reacts to speech faster than a person would. It doesn't say useful stuff and there's no clear path to plugging it into smarter models, but it's worth experiencing.