| ▲ | otabdeveloper4 a day ago | |
Re point 1: no, "temperature" is not an inherent property of LLM's. The big cloud providers use the "temperature" setting because having the assistant repeat to you the exact same output sequence exposes the man behind the curtain and breaks suspension of disbelief. But if you run the LLM yourself and you want the best quality output, then turning off "temperature" entirely makes sense. That's what I do. (The downside is that the LLM can then, rarely, get stuck in infinite loops. Again, this isn't a big deal unless you really want to persist with the delusion that an LLM is a human-like assistant.) | ||
| ▲ | czl 16 hours ago | parent [-] | |
I mostly agree with your intuition, but I’d phrase it a bit differently. Temperature 0 does not inherently improve “quality”. It just means you always pick the highest probability token at each step, so if you run the same prompt n times you will essentially get the same answer every time. That is great for predictability and some tasks like strict data extraction or boilerplate code, but “highest probability” is not always “best” for every task. If you use a higher temperature and sample multiple times, you get a set of diverse answers. You can then combine them, for example by taking the most common answer, cross checking details, or using one sample to critique another. This kind of self-ensemble can actually reduce hallucinations and boost accuracy for reasoning or open ended questions. In that sense, somewhat counterintuitively, always using temperature 0 can lead to lower quality results if you care about that ensemble style robustness. One small technical nit: even with temperature 0, decoding on a GPU is not guaranteed to be bit identical every run. Large numbers of floating point ops in parallel can change the order of additions and multiplications, and floating point arithmetic is not associative. Different kernel schedules or thread interleavings can give tiny numeric differences that sometimes shift an argmax choice. To make it fully deterministic you often have to disable some GPU optimizations or run on CPU only, which has a performance cost. | ||