| ▲ | lxgr 2 days ago | |||||||
Good idea, but an external “bolted on” LLM-based TTS would still pass that in many cases, right? | ||||||||
| ▲ | sosodev 2 days ago | parent | next [-] | |||||||
Yes, a sufficiently advanced marrying of TTS and LLM could pass a lot of these tests. That kind of blurs the line between native voice model and not though. You would need: * A STT (ASR) model that outputs phonetics not just words * An LLM fine-tuned to understand that and also output the proper tokens for prosody control, non-speech vocalizations, etc * A TTS model that understands those tokens and properly generate the matching voice At that point I would probably argue that you've created a native voice model even if it's still less nuanced than the proper voice to voice of something like 4o. The latency would likely be quite high though. I'm pretty sure I've seen a couple of open source projects that have done this type of setup but I've not tried testing them. | ||||||||
| ||||||||
| ▲ | barrkel 2 days ago | parent | prev | next [-] | |||||||
The model giving it text to speak would have to annotate the text in order for the TTS to add the affect. The TTS wouldn't "remember" such instructions from a speech to text stage previously. | ||||||||
| ▲ | jablongo 2 days ago | parent | prev [-] | |||||||
I tried to make ChatGPT sing Mary had a little lamb recently and it's atonal but vaguely resembles the melody, which is interesting. | ||||||||