Remix clone Hacker News

new | show | ask | jobs Github

	▲	derefr a day ago
		Is there yet any model like this, but which works as a "speech plus speech to speech" voice modulator — i.e. taking a fixed audio sample (the prompt), plus a continuous audio stream (the input), and transforming any speech component of the input to have the tone and timbre of the voice in the prompt, resulting in a continuous audio output stream? (Ideally, while passing through non-speech parts of the input audio stream; but those could also be handled other ways, with traditional source separation techniques, microphone arrays, etc.) Though I suppose, for the use-case I'm thinking of (v-tubers), you don't really need the ability to dynamically change the prompt; so you could also simplify this to a continuous single-stream "speech to speech" model, which gets its target vocal timbre burned into it during an expensive (but one-time) fine-tuning step.
	▲	gcr a day ago \| parent \| next [-]
		Chatterbox TTS does this in “voice cloning” mode but you have to implement the streaming part yourself. There are two inputs: audio A (“style”) and B (“content”). The timbre is taken from A, and the content, pronunciation, prosody, accent, etc is taken from B. Strictly soeaking, voice cloning models like this and chatterbox are not “TTS” - they’re better thought of as “S+STS”, that is, speech+style to speech
	▲	vunderba a day ago \| parent \| prev \| next [-]
		I don’t know about open models, but ElevenLabs has had this idea of mapping intonation/emotion/inflections onto a designated TTS voice for a while. https://elevenlabs.io/blog/speech-to-speech
	▲	lumerios a day ago \| parent \| prev \| next [-]
		yes, check out RVC (retrieval voice conversation) which I believe is the only good open source voice changer. Currently there's a bit of a conflict between the original creator and current developers. So don't use the main fork. I think you'll be able to find a more up-to-date fork that's in english.
	▲	qingcharles a day ago \| parent \| prev [-]
		There must be something out there that does this reliably as I often see/hear v-tubers doing it.