▲ | com2kid 3 days ago | |
> they seem to be producing text, which is then put through text-to-speech (using what appeared to be a voice trained on their own -- nice touch). This is an LLM model thing. Plenty of open source (or at least MIT licensed) LLMs and TTS models exist that translate and can be zero shot trained on a user's speech. Direct audio to audio models tend to be less researched and less advanced than the corresponding (but higher latency) audio to text to audio pipelines. That said you can get audio->text->audio down to 400ms or so latency if you are really damn good at it. |