Remix.run Logo
evrydayhustling 4 hours ago

An LLM model itself -- that is, the weights and the mathematical functions linking them -- does not tell you exactly how to train from data, nor how to generate an output. Instead, it describes a function providing relative likelihood(output | input).

Deciding how to pick a particular output given that likelihood function is left as an exercise for the user, which we call inference.

One obvious choice is to keep picking the highest likelihood token, feed it into the model, and get another -- on repeat. This is what most algorithms call "temperature=0". But doing this for token after token can lead boring output, or steer you into pathological low-probability sequences like a set of endless repeats.

So, the current SOTA is to intentionally introduce a random factor (temperature>0) to the sampling process -- along with other hacks, like explicit suppression of repeats.