Good idea, but an external “bolted on” LLM-based TTS would still pass that in many cases, right?

Yes, a sufficiently advanced marrying of TTS and LLM could pass a lot of these tests. That kind of blurs the line between native voice model and not though.

You would need:

* A STT (ASR) model that outputs phonetics not just words

* An LLM fine-tuned to understand that and also output the proper tokens for prosody control, non-speech vocalizations, etc

* A TTS model that understands those tokens and properly generate the matching voice

At that point I would probably argue that you've created a native voice model even if it's still less nuanced than the proper voice to voice of something like 4o. The latency would likely be quite high though. I'm pretty sure I've seen a couple of open source projects that have done this type of setup but I've not tried testing them.

	▲	BoxOfRain a day ago \| parent [-]
		I've been experimenting with something similar to this approach recently. IndexTTS2 gives you emotion vectors as an input, I used an external emotion classification model on the LLM output to modulate the TTS emotion vectors. You need to manage the state of the current affect with a bit of care or it sounds unhinged, but it's worked surprisingly well so far. I wired it together using Cats Effect. As you'd expect latency isn't great, but I think it can be improved.

▲

barrkel 2 days ago | parent | prev | next [-]

The model giving it text to speak would have to annotate the text in order for the TTS to add the affect. The TTS wouldn't "remember" such instructions from a speech to text stage previously.

▲

jablongo 2 days ago | parent | prev [-]

I tried to make ChatGPT sing Mary had a little lamb recently and it's atonal but vaguely resembles the melody, which is interesting.