The HF demo space was overloaded, but I got the demo working locally easily enough. The voice cloning of the 1.7B model captures the tone of the speaker very well, but I found it failed at reproducing the variation in intonation, so it sounds like a monotonous reading of a boring text.

I presume this is due to using the base model, and not the one tuned for more expressiveness.

edit: Or more likely, the demo not exposing the expressiveness controls.

The 1.7B model was much better at ignoring slight background noise in the reference audio compared to the 0.6B model though. The 0.6B would inject some of that into the generated audio, whereas the 1.7B model would not.

Also, without FlashAttention it was dog slow on my 5090, running at 0.3X realtime with just 30% GPU usage. Though I guess that's to be expected. No significant difference in generation speed between the two models.

Overall though, I'm quite impressed. I haven't checked out all the recent TTS models, but a fair number, and this one is certainly one of the better ones in terms of voice cloning quality I've heard.

▲ thedangler 4 hours ago | parent | next [-]

How did you do this locally? Tools? Language?

	▲	magicalhippo 32 minutes ago \| parent [-]
		I just followed the Quickstart[1] in the GitHub repo, refreshingly straight forward. Using the pip package worked fine, as did installing the editable version using the git repository. Just install the CUDA version of PyTorch[2] first. The HF demo is very similar to the GitHub demo, so easy to try out. `qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --no-flash-attn --ip 127.0.0.1 --port 8000` Skipped FlashAttention since I'm on Windows and I haven't gotten FlashAttention 2 to work there yet (I found some precompiled FA3 files[3] but Qwen3-TTS isn't FA3 compatible yet). [1]: https://github.com/QwenLM/Qwen3-TTS?tab=readme-ov-file#quick... [2]: https://pytorch.org/get-started/locally/ [3]: https://windreamer.github.io/flash-attention3-wheels/

▲ dsrtslnd23 2 hours ago | parent | prev [-]

Any idea on the VRAM footprint for the 1.7B model? I guess it fits on consumer cards but I am wondering if it works on edge devices.

	▲	magicalhippo 36 minutes ago \| parent [-]
		The demo uses 6GB dedicated VRAM on Windows, but keep in mind that it's without FlashAttention. I expect it would drop a bit if I got that working. Haven't looked into the demo to see if it could be optimized by moving certain bits to CPU for example.