| ▲ | sosodev 2 days ago | |
Yes, a sufficiently advanced marrying of TTS and LLM could pass a lot of these tests. That kind of blurs the line between native voice model and not though. You would need: * A STT (ASR) model that outputs phonetics not just words * An LLM fine-tuned to understand that and also output the proper tokens for prosody control, non-speech vocalizations, etc * A TTS model that understands those tokens and properly generate the matching voice At that point I would probably argue that you've created a native voice model even if it's still less nuanced than the proper voice to voice of something like 4o. The latency would likely be quite high though. I'm pretty sure I've seen a couple of open source projects that have done this type of setup but I've not tried testing them. | ||
| ▲ | BoxOfRain a day ago | parent [-] | |
I've been experimenting with something similar to this approach recently. IndexTTS2 gives you emotion vectors as an input, I used an external emotion classification model on the LLM output to modulate the TTS emotion vectors. You need to manage the state of the current affect with a bit of care or it sounds unhinged, but it's worked surprisingly well so far. I wired it together using Cats Effect. As you'd expect latency isn't great, but I think it can be improved. | ||