Remix.run Logo
nateb2022 a day ago

> Zero-shot means zero-retraining, so think along the lines of "Do you need to modify the weights? Or can you keep the weights fixed and you only need to supply an example?"

I would caution that using the term "example" suggests further learning happens at inference-time, which isn't the case.

For LLMs, the entire prompt is the input and conveys both the style and the content vectors. In zero-shot voice cloning, we provide the exact same inputs vectors but just decoupled. Providing reference audio is no different than including "Answer in the style of Sir Isaac Newton" in an LLM's prompt. The model doesn't 'learn' the voice; it simply applies the style vector to the content during the forward pass.