A lot of good small TTS models in recent times. Most seem to struggle hard on prosody though.

Kokoro TTS for example has a very good Norwegian voice but the rhythm and emphasizing is often so out of whack the generated speech is almost incomprehensible.

Haven't had time to check this model out yet, how does it fare here? What's needed to improve the models in this area now that the voice part is more or less solved?

▲

rohan_joshi 4 hours ago | parent | next [-]

small models struggle with prosody due to limited capacity. this version does much better than the precious one and is the best among other <25MB models. Kokoro is a really good model for its size, its competitive on artificial analysis too. i think by the next release we should have something kokoro quality but a fifth of the size. Adding control for rhythm seems to be quite important too, and we should start looking at that for other languages.

	▲	magicalhippo 25 minutes ago \| parent [-]
		Listened to the video examples, sounded very good though wasn't terribly challenging text. If only I could have that in Norwegian my SO would be pleased. Also I totally misremembered regarding Kokoro TTS. It's good, but not what was butchering Norwegian. Forgot which one I was thinking of, maybe it was the old VITS stuff Rhaspy uses. Points stand, the voice was good but could barely understand what was said.

▲

soco 5 hours ago | parent | prev [-]

That, and also using English words in the middle of another language phrase confuses them a lot.

	▲	rohan_joshi 4 hours ago \| parent [-]
		yes. the current release of our model is english-only. so other languages are not expected to perform well. we'll try to look out for this in our multilingual release.