So if you get your target to record (say) 1 hour of audio, that's a one-shot.

If you didn't do that (because you have 100 hours of other people talking), that's zero-shots, no?

nateb2022 a day ago | parent [-]

> So if you get your target to record (say) 1 hour of audio, that's a one-shot.

No, that would still be zero shot. Providing inference-time context (in this case, audio) is no different than giving a prompt to an LLM. Think of it as analogous to an AGENTS.md included in a prompt. You're not retraining the model, you're simply putting the rest of the prompt into context.

If you actually stopped and fine-tuned the model weights on that single clip, that would be one-shot learning.

▲

ImPostingOnHN 14 hours ago | parent [-]

> Providing inference-time context (in this case, audio) is no different than giving a prompt to an LLM.

Right... And you have 0-shot prompts ("give me a list of animals"), 1-shot prompts ("give me a list of animals, for example: a cat"), 2-shot prompts ("give me a list of animals, for example: a cat; a dog"), etc.

The "shot" refers to how many examples are provided to the LLM in the prompt, and have nothing to do with training or tuning, in every context I've ever seen.

	▲	nateb2022 12 hours ago \| parent [-]
		> Right... And you have 0-shot prompts ("give me a list of animals"), 1-shot prompts ("give me a list of animals, for example: a cat"), 2-shot prompts ("give me a list of animals, for example: a cat; a dog"), etc. > The "shot" refers to how many examples are provided to the LLM in the prompt, and have nothing to do with training or tuning, in every context I've ever seen. In formal ML, "shot" refers to the number of samples available for a specific class during the training phase. You're describing a colloquial usage of the term found only in prompt engineering. You can't apply an LLMism to a voice cloning model where standard ML definitions apply.