Remix.run Logo
gcr a day ago

Chatterbox TTS does this in “voice cloning” mode but you have to implement the streaming part yourself.

There are two inputs: audio A (“style”) and B (“content”). The timbre is taken from A, and the content, pronunciation, prosody, accent, etc is taken from B.

Strictly soeaking, voice cloning models like this and chatterbox are not “TTS” - they’re better thought of as “S+STS”, that is, speech+style to speech