Remix.run Logo
derefr a day ago

Is there yet any model like this, but which works as a "speech plus speech to speech" voice modulator — i.e. taking a fixed audio sample (the prompt), plus a continuous audio stream (the input), and transforming any speech component of the input to have the tone and timbre of the voice in the prompt, resulting in a continuous audio output stream? (Ideally, while passing through non-speech parts of the input audio stream; but those could also be handled other ways, with traditional source separation techniques, microphone arrays, etc.)

Though I suppose, for the use-case I'm thinking of (v-tubers), you don't really need the ability to dynamically change the prompt; so you could also simplify this to a continuous single-stream "speech to speech" model, which gets its target vocal timbre burned into it during an expensive (but one-time) fine-tuning step.

gcr a day ago | parent | next [-]

Chatterbox TTS does this in “voice cloning” mode but you have to implement the streaming part yourself.

There are two inputs: audio A (“style”) and B (“content”). The timbre is taken from A, and the content, pronunciation, prosody, accent, etc is taken from B.

Strictly soeaking, voice cloning models like this and chatterbox are not “TTS” - they’re better thought of as “S+STS”, that is, speech+style to speech

vunderba a day ago | parent | prev | next [-]

I don’t know about open models, but ElevenLabs has had this idea of mapping intonation/emotion/inflections onto a designated TTS voice for a while.

https://elevenlabs.io/blog/speech-to-speech

lumerios a day ago | parent | prev | next [-]

yes, check out RVC (retrieval voice conversation) which I believe is the only good open source voice changer. Currently there's a bit of a conflict between the original creator and current developers. So don't use the main fork. I think you'll be able to find a more up-to-date fork that's in english.

qingcharles a day ago | parent | prev [-]

There must be something out there that does this reliably as I often see/hear v-tubers doing it.