| ▲ | modeless 3 hours ago | ||||||||||||||||
IMO STT -> LLM -> TTS is a dead end. The future is end-to-end. I played with this two years ago and even made a demo you can install locally on a gaming GPU: https://github.com/jdarpinian/chirpy, but concluded that making something worth using for real tasks would require training of end-to-end models. A really interesting problem I would love to tackle, but out of my budget for a side project. | |||||||||||||||||
| ▲ | donpark 8 minutes ago | parent | next [-] | ||||||||||||||||
But I've read somewhere that KV cache for speech-to-speech model explodes in size with each turn which could make on-device full-duplex S2S unusable except for quick chats. | |||||||||||||||||
| ▲ | nicktikhonov 3 hours ago | parent | prev | next [-] | ||||||||||||||||
If you're of that opinion, you'll enjoy the new stuff coming out from nvidia: | |||||||||||||||||
| |||||||||||||||||
| ▲ | com2kid 40 minutes ago | parent | prev [-] | ||||||||||||||||
The advantage is being able to plug in new models to each piece of the pipeline. Is it super sexy? No. But each individual type of model is developing at a different rate (TTS moves really fast, low latency STT/ASR moved slower, LLMs move at a pretty good pace). | |||||||||||||||||