| ▲ | Show HN: I built a sub-500ms latency voice agent from scratch(ntik.me) | |||||||||||||||||||||||||
| 99 points by nicktikhonov 3 hours ago | 24 comments | ||||||||||||||||||||||||||
I built a voice agent from scratch that averages ~400ms end-to-end latency (phone stop → first syllable). That’s with full STT → LLM → TTS in the loop, clean barge-ins, and no precomputed responses. What moved the needle: Voice is a turn-taking problem, not a transcription problem. VAD alone fails; you need semantic end-of-turn detection. The system reduces to one loop: speaking vs listening. The two transitions - cancel instantly on barge-in, respond instantly on end-of-turn - define the experience. STT → LLM → TTS must stream. Sequential pipelines are dead on arrival for natural conversation. TTFT dominates everything. In voice, the first token is the critical path. Groq’s ~80ms TTFT was the single biggest win. Geography matters more than prompts. Colocate everything or you lose before you start. GitHub Repo: https://github.com/NickTikhonov/shuo Follow whatever I next tinker with: https://x.com/nick_tikhonov | ||||||||||||||||||||||||||
| ▲ | armcat an hour ago | parent | next [-] | |||||||||||||||||||||||||
This is an outstanding write up, thank you! Regarding LLM latency, OpenAI introduced web sockets in their Responses client recently so it should be a bit faster. An alternative is to have a super small LLM running locally on your device. I built my own pipeline fully local and it was sub second RTT, with no streaming nor optimisations https://github.com/acatovic/ova | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | age123456gpg an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||
Hi all! Check out this Handy app https://github.com/cjpais/Handy - a free, open source, and extensible speech-to-text application that works completely offline. I am using it daily to drive Claude and it works really-well for me (much better than macOS dictation mode). | ||||||||||||||||||||||||||
| ▲ | modeless an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||
IMO STT -> LLM -> TTS is a dead end. The future is end-to-end. I played with this two years ago and even made a demo you can install locally on a gaming GPU: https://github.com/jdarpinian/chirpy, but concluded that making something worth using for real tasks would require training of end-to-end models. A really interesting problem I would love to tackle, but out of my budget for a side project. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | NickNaraghi 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||
Pretty exciting breakthrough. This actually mirrors the early days of game engine netcode evolution. Since latency is an orchestration problem (not a model problem) you can beat general-purpose frameworks by co-locating and pipelining aggressively. Carmack's 2013 "Latency Mitigation Strategies" paper[0] made the same point for VR too: every millisecond hides in a different stage of the pipeline, and you only find them by tracing the full path yourself. Great find with the warm TTS websocket pool saving ~300ms, perfect example of this. | ||||||||||||||||||||||||||
| ▲ | lukax 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||
Or you could use Soniox Real-time (supports 60 languages) which natively supports endpoint detection - the model is trained to figure out when a user's turn ended. This always works better than VAD. https://soniox.com/docs/stt/rt/endpoint-detection Soniox also wins the independent benchmarks done by Daily, the company behind Pipecat. https://www.daily.co/blog/benchmarking-stt-for-voice-agents/ You can try a demo on the home page: Disclaimer: I used to work for Soniox Edit: I commented too soon. I only saw VAD and immediately thought of Soniox which was the first service to implement real time endpoint detection last year. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | docheinestages an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||
Does anyone know about a fully offline, open-source project like this voice agent (i.e. STT -> LLM -> TTS)? | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | boznz 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||
"Voice is an orchestration problem" is basically correct. The two takeaways from this for me are 1. I wonder if it could be optimised more by just having a single language, and 2. How do we get around the problem of interference, humans are good at conversation discrimination ie listing while multiple conversations, TV, music, etc are going on in the background, I've not had too much success with voice in noisy environments. | ||||||||||||||||||||||||||
| ▲ | loevborg 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||
Nice write-up, thanks for sharing. How does your hand-vibed python program compare to frameworks like pipecat or livekit agents? Both are also written in python. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | perelin 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||
Great writeup! For VAD did you use heaphone/mic combo, or an open mic? If open, how did you deal with the agent interupting itself? | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | MbBrainz 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||
Love it! Solving the latency problem is essential to making voice ai usable and comfortable. Your point on VAD is interesting - hadn't thought about that. | ||||||||||||||||||||||||||
| ▲ | shubh-chat 23 minutes ago | parent | prev | next [-] | |||||||||||||||||||||||||
This is superb, Nick! Thanks for this. Will try it out at somepoint for a project I am trying to build. | ||||||||||||||||||||||||||
| ▲ | CagedJean an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||
Do you have hot talk when you are alone in the shower with HER? | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | jangletown 2 hours ago | parent | prev [-] | |||||||||||||||||||||||||
impressive | ||||||||||||||||||||||||||