| ▲ | coder543 a day ago |
| Why wouldn’t that be one-shot voice cloning? The concept of calling it zero shot doesn’t really make sense to me. |
|
| ▲ | ben_w a day ago | parent | next [-] |
| Zero-shot means zero-retraining, so think along the lines of "Do you need to modify the weights? Or can you keep the weights fixed and you only need to supply an example?" As with other replies, yes this is a silly name. |
| |
| ▲ | nateb2022 12 hours ago | parent [-] | | > Zero-shot means zero-retraining, so think along the lines of "Do you need to modify the weights? Or can you keep the weights fixed and you only need to supply an example?" I would caution that using the term "example" suggests further learning happens at inference-time, which isn't the case. For LLMs, the entire prompt is the input and conveys both the style and the content vectors. In zero-shot voice cloning, we provide the exact same inputs vectors but just decoupled. Providing reference audio is no different than including "Answer in the style of Sir Isaac Newton" in an LLM's prompt. The model doesn't 'learn' the voice; it simply applies the style vector to the content during the forward pass. |
|
|
| ▲ | nateb2022 a day ago | parent | prev | next [-] |
| Providing inference-time context (in this case, audio) is no different than giving a prompt to an LLM. Think of it as analogous to an AGENTS.md included in a prompt. You're not retraining the model, you're simply putting the rest of the prompt into context. If you actually stopped and fine-tuned the model weights on that single clip, that would be one-shot learning. |
| |
| ▲ | coder543 a day ago | parent [-] | | To me, a closer analogy is In Context Learning. In the olden days of 2023, you didn’t just find instruct-tuned models sitting on every shelf. You could use a base model that has only undergone pretraining and can only generate text continuations based on the input it receives. If you provided the model with several examples of a question followed by an answer, and then provided a new question followed by a blank for the next answer, the model understood from the context that it needed to answer the question. This is the most primitive use of ICL, and a very basic way to achieve limited instruction following behavior. With this few-shot example, I would call that few-shot ICL. Not zero shot, even though the model weights are locked. But, I am learning that it is technically called zero shot, and I will accept this, even if I think it is a confusingly named concept. |
|
|
| ▲ | woodson a day ago | parent | prev | next [-] |
| I don't disagree, but that's what people started calling it. Zero-shot doesn't make sense anyway, as how would the model know what voice it should sound like (unless it's a celebrity voice or similar included in the training data where it's enough to specify a name). |
| |
| ▲ | nateb2022 a day ago | parent | next [-] | | > Zero-shot doesn't make sense anyway, as how would the model know what voice it should sound like (unless it's a celebrity voice or similar included in the training data where it's enough to specify a name). It makes perfect sense; you are simply confusing training samples with inference context. "Zero-shot" refers to zero gradient updates (retraining) required to handle a new class. It does not mean "zero input information." > how would the model know what voice it should sound like It uses the reference audio just like a text based model uses a prompt. > unless it's a celebrity voice or similar included in the training data where it's enough to specify a name If the voice is in the training data, that is literally the opposite of zero-shot. The entire point of zero-shot is that the model has never encountered the speaker before. | | |
| ▲ | magicalhippo a day ago | parent [-] | | With LLMs I've seen zero-shot used to describe scenarios where there's no example, it "take this and output JSON", while one-shot has the prompt include an example like "take this and output JSON, for this data the JSON should look like this". Thus if you feed a the model target voice, ie an example of the desired output vouce, it sure seems like it should be classified as one-shot. However it seems the zero-shot in voice cloning is relative to learning, and in contrast to one-shot learning[1]. So a bit overloaded term causing confusion from what I can gather. [1]: https://en.wikipedia.org/wiki/One-shot_learning_(computer_vi... | | |
| ▲ | nateb2022 a day ago | parent [-] | | The confusion clears up if you stop conflating contextual conditioning (prompting) with actual Learning (weight updates). For LLMs, "few-shot prompting" is technically a misnomer that stuck; you are just establishing a pattern in the context window, not training the model. In voice cloning, the reference audio is simply the input, not a training example. You wouldn't say an image classifier is doing "one-shot learning" just because you fed it one image to classify. That image is the input. Similarly, the reference audio is the input that conditions the generation. It is zero-shot because the model's weights were never optimized for that specific speaker's manifold. |
|
| |
| ▲ | a day ago | parent | prev [-] | | [deleted] |
|
|
| ▲ | geocar a day ago | parent | prev | next [-] |
| So if you get your target to record (say) 1 hour of audio, that's a one-shot. If you didn't do that (because you have 100 hours of other people talking), that's zero-shots, no? |
| |
| ▲ | nateb2022 a day ago | parent [-] | | > So if you get your target to record (say) 1 hour of audio, that's a one-shot. No, that would still be zero shot. Providing inference-time context (in this case, audio) is no different than giving a prompt to an LLM. Think of it as analogous to an AGENTS.md included in a prompt. You're not retraining the model, you're simply putting the rest of the prompt into context. If you actually stopped and fine-tuned the model weights on that single clip, that would be one-shot learning. | | |
| ▲ | ImPostingOnHN 14 hours ago | parent [-] | | > Providing inference-time context (in this case, audio) is no different than giving a prompt to an LLM. Right... And you have 0-shot prompts ("give me a list of animals"), 1-shot prompts ("give me a list of animals, for example: a cat"), 2-shot prompts ("give me a list of animals, for example: a cat; a dog"), etc. The "shot" refers to how many examples are provided to the LLM in the prompt, and have nothing to do with training or tuning, in every context I've ever seen. | | |
| ▲ | nateb2022 12 hours ago | parent [-] | | > Right... And you have 0-shot prompts ("give me a list of animals"), 1-shot prompts ("give me a list of animals, for example: a cat"), 2-shot prompts ("give me a list of animals, for example: a cat; a dog"), etc. > The "shot" refers to how many examples are provided to the LLM in the prompt, and have nothing to do with training or tuning, in every context I've ever seen. In formal ML, "shot" refers to the number of samples available for a specific class during the training phase. You're describing a colloquial usage of the term found only in prompt engineering. You can't apply an LLMism to a voice cloning model where standard ML definitions apply. |
|
|
|
|
| ▲ | oofbey a day ago | parent | prev [-] |
| It’s nonsensical to call it “zero shot” when a sample of the voice is provided. The term “zero shot cloning” implies you have some representation of the voice from another domain - e.g. a text description of the voice. What they’re doing is ABSOLUTELY one shot cloning. I don’t care if lots of STT folks use the term this way, they’re wrong. |