Remix.run Logo
nateb2022 a day ago

Providing inference-time context (in this case, audio) is no different than giving a prompt to an LLM. Think of it as analogous to an AGENTS.md included in a prompt. You're not retraining the model, you're simply putting the rest of the prompt into context.

If you actually stopped and fine-tuned the model weights on that single clip, that would be one-shot learning.

coder543 a day ago | parent [-]

To me, a closer analogy is In Context Learning.

In the olden days of 2023, you didn’t just find instruct-tuned models sitting on every shelf.

You could use a base model that has only undergone pretraining and can only generate text continuations based on the input it receives. If you provided the model with several examples of a question followed by an answer, and then provided a new question followed by a blank for the next answer, the model understood from the context that it needed to answer the question. This is the most primitive use of ICL, and a very basic way to achieve limited instruction following behavior.

With this few-shot example, I would call that few-shot ICL. Not zero shot, even though the model weights are locked.

But, I am learning that it is technically called zero shot, and I will accept this, even if I think it is a confusingly named concept.