This is an outstanding write up, thank you! Regarding LLM latency, OpenAI introduced web sockets in their Responses client recently so it should be a bit faster. An alternative is to have a super small LLM running locally on your device. I built my own pipeline fully local and it was sub second RTT, with no streaming nor optimisations https://github.com/acatovic/ova

▲

nicktikhonov 3 hours ago | parent [-]

Very cool! starred and on my reading list. Would love to chat and share notes, if you'd like

	▲	alfalfasprout an hour ago \| parent [-]
		Also consider using Cerebras' inference APIs. They released a voice demo a while back and the latency of their model inference is insane.