Remix.run Logo
nateb2022 a day ago

> Zero-shot doesn't make sense anyway, as how would the model know what voice it should sound like (unless it's a celebrity voice or similar included in the training data where it's enough to specify a name).

It makes perfect sense; you are simply confusing training samples with inference context. "Zero-shot" refers to zero gradient updates (retraining) required to handle a new class. It does not mean "zero input information."

> how would the model know what voice it should sound like

It uses the reference audio just like a text based model uses a prompt.

> unless it's a celebrity voice or similar included in the training data where it's enough to specify a name

If the voice is in the training data, that is literally the opposite of zero-shot. The entire point of zero-shot is that the model has never encountered the speaker before.

magicalhippo a day ago | parent [-]

With LLMs I've seen zero-shot used to describe scenarios where there's no example, it "take this and output JSON", while one-shot has the prompt include an example like "take this and output JSON, for this data the JSON should look like this".

Thus if you feed a the model target voice, ie an example of the desired output vouce, it sure seems like it should be classified as one-shot.

However it seems the zero-shot in voice cloning is relative to learning, and in contrast to one-shot learning[1].

So a bit overloaded term causing confusion from what I can gather.

[1]: https://en.wikipedia.org/wiki/One-shot_learning_(computer_vi...

nateb2022 a day ago | parent [-]

The confusion clears up if you stop conflating contextual conditioning (prompting) with actual Learning (weight updates). For LLMs, "few-shot prompting" is technically a misnomer that stuck; you are just establishing a pattern in the context window, not training the model.

In voice cloning, the reference audio is simply the input, not a training example. You wouldn't say an image classifier is doing "one-shot learning" just because you fed it one image to classify. That image is the input. Similarly, the reference audio is the input that conditions the generation. It is zero-shot because the model's weights were never optimized for that specific speaker's manifold.