That's cool and useful.

IMO, the best alternative is Chatterbox-TTS-Server [0] (slower, but quite high quality).

[0] https://github.com/devnen/Chatterbox-TTS-Server

BoxOfRain 13 hours ago | parent | next [-]

I quite like IndexTTS2 personally, it does voice cloning and also lets you modulate emotion manually through emotion vectors which I've found quite a powerful tool. It's not necessarily something everyone needs, but it's really cool technology in my opinion.

It's been particularly useful for a model orchestration project I've been working on. I have an external emotion classification model driving both the LLM's persona and the TTS output so it stays relatively consistent. The affect system also influences which memories are retrieved; it's more likely to retrieve 'memories' created in the current affect state. IndexTTS2 was pretty much the only TTS that gives the level of control I felt was necessary.

	▲	realityfactchex 6 hours ago \| parent [-]
		Wow, the IndexTTS2 demo is very good. Definitely going to check that out. Thanks. [0] https://indextts2.org

▲

iLoveOncall 15 hours ago | parent | prev [-]

Chatterbox-TTS has a MUCH MUCH better output quality though, the quality of the output from Sopro TTS (based on the video embedded on GitHub) is absolutely terrible and completely unusable for any serious application, while Chatterbox has incredible outputs.

I have an RTX5090, so not exactly what most consumers will have but still accessible, and it's also very fast, around 2 seconds of audio per 1 second of generation.

Here's an example I just generated (first try, 22 seconds runtime, 14 seconds of generation): https://jumpshare.com/s/Vl92l7Rm0IhiIk0jGors

Here's another one, 20 seconds of generation, 30 seconds of runtime, which clones a voice from a Youtuber (I don't use it for nefarious reasons, it's just for the demo): https://jumpshare.com/s/Y61duHpqvkmNfKr4hGFs with the original source for the voice: https://www.youtube.com/@ArbitorIan

▲

sammyyyyyyy 15 hours ago | parent | next [-]

You should try it! I wouldn’t say it’s the best, far from that. But also wouldn’t say it’s terrible. If you have a 5090, then yes, you can run much more powerful models in real time. Chatterbox is a great model though

▲

iLoveOncall 14 hours ago | parent [-]

> But also wouldn’t say it’s terrible.

But you included 3 samples on your GitHub video and they all sound extremely robotic and have very bad artifacts?

	▲	samuel-vitorino 14 hours ago \| parent [-]
		[dead]

▲

kkzz99 15 hours ago | parent | prev [-]

I've been using Higgs-Audio for a while now as the primary TTS system. How would you say does Chatterbox compare to it if you have experience?

	▲	iLoveOncall 14 hours ago \| parent [-]
		I haven't used it. I compared it with T5Gemma TTS that came out recently and Chatterbox is much better in all aspects, but especially in voice cloning where T5Gemma basically did not work.