Remix.run Logo
jldugger 2 days ago

Well, there's kind of two answers here:

1. To the extent that creativity is randomness, LLM inference samples from the token distribution at each step. It's possible (but unlikely!) for an LLM to complete "pig with" with the token sequence "a dragon head" just by random chance. The temperature settings commonly exposed control how often the system takes the most likely candidate tokens.

2. A markov chain model will literally have a matrix entry for every possible combination of inputs. So a 2 degree chain will have n^2 weights, where N is the number of possible tokens. In that situation "pig with" can never be completed with a brand new sentence, because those have literal 0's in the probability. In contrast, transformers consider huge context windows, and start with random weights in huge neural network matrices. What people hope happens is that the NN begins to represent ideas, and connections between them. This gives them a shot at passing "out of distribution" tests, which is a cornerstone of modern AI evaluation.

thesz a day ago | parent | next [-]

  > A markov chain model will literally have a matrix entry for every possible combination of inputs.
The less frequent prefixes are usually pruned away and there is a penalty score to add to go to the shorter prefix. In the end, all words are included into the model's prediction and typical n-gram SRILM model is able to generate "the pig with dragon head," also with small probability.

Even if you think about Markov Chain information as a tensor (not matrix), the computation of probabilities is not a single lookup, but a series of folds.

vrighter a day ago | parent | prev | next [-]

A markov chain model does not specify the implementation details of the function that takes a previous input (and only a previous input) and outputs a probability distribution. You could put all possible inputs into an llm (there's finitely many) and record the resulting output from each input in a table. "Temperature" is applied to the final output, not inside the function.

theGnuMe a day ago | parent | prev | next [-]

You can have small epsilons instead of zeros.

ruined a day ago | parent [-]

what, for all possible words?

3eb7988a1663 a day ago | parent [-]

Instead of a naive dense matrix, you can use some implementation that allows sparsity. If element does not exist, gets a non-zero value which can still be sampled. Which theoretically enables all outputs.

ruined 8 hours ago | parent [-]

i think at that point it's definitionally not a markov chain anymore. how do you sample an open set of unknown values?

otabdeveloper4 a day ago | parent | prev [-]

Re point 1: no, "temperature" is not an inherent property of LLM's.

The big cloud providers use the "temperature" setting because having the assistant repeat to you the exact same output sequence exposes the man behind the curtain and breaks suspension of disbelief.

But if you run the LLM yourself and you want the best quality output, then turning off "temperature" entirely makes sense. That's what I do.

(The downside is that the LLM can then, rarely, get stuck in infinite loops. Again, this isn't a big deal unless you really want to persist with the delusion that an LLM is a human-like assistant.)

czl 16 hours ago | parent [-]

I mostly agree with your intuition, but I’d phrase it a bit differently.

Temperature 0 does not inherently improve “quality”. It just means you always pick the highest probability token at each step, so if you run the same prompt n times you will essentially get the same answer every time. That is great for predictability and some tasks like strict data extraction or boilerplate code, but “highest probability” is not always “best” for every task.

If you use a higher temperature and sample multiple times, you get a set of diverse answers. You can then combine them, for example by taking the most common answer, cross checking details, or using one sample to critique another. This kind of self-ensemble can actually reduce hallucinations and boost accuracy for reasoning or open ended questions. In that sense, somewhat counterintuitively, always using temperature 0 can lead to lower quality results if you care about that ensemble style robustness.

One small technical nit: even with temperature 0, decoding on a GPU is not guaranteed to be bit identical every run. Large numbers of floating point ops in parallel can change the order of additions and multiplications, and floating point arithmetic is not associative. Different kernel schedules or thread interleavings can give tiny numeric differences that sometimes shift an argmax choice. To make it fully deterministic you often have to disable some GPU optimizations or run on CPU only, which has a performance cost.