IMO STT -> LLM -> TTS is a dead end. The future is end-to-end. I played with this two years ago and even made a demo you can install locally on a gaming GPU: https://github.com/jdarpinian/chirpy, but concluded that making something worth using for real tasks would require training of end-to-end models. A really interesting problem I would love to tackle, but out of my budget for a side project.

▲

donpark 8 minutes ago | parent | next [-]

But I've read somewhere that KV cache for speech-to-speech model explodes in size with each turn which could make on-device full-duplex S2S unusable except for quick chats.

▲

nicktikhonov 3 hours ago | parent | prev | next [-]

If you're of that opinion, you'll enjoy the new stuff coming out from nvidia:

https://research.nvidia.com/labs/adlr/personaplex/

▲

woodson 2 hours ago | parent [-]

You mean Moshi (https://github.com/kyutai-labs/moshi)? Since Personaplex is just a finetuned Moshi model.

	▲	mountainriver 2 hours ago \| parent [-]
		Yeah except moshi doesn’t sound good at all

▲

com2kid 40 minutes ago | parent | prev [-]

The advantage is being able to plug in new models to each piece of the pipeline.

Is it super sexy? No. But each individual type of model is developing at a different rate (TTS moves really fast, low latency STT/ASR moved slower, LLMs move at a pretty good pace).