> so in principle, setting temperature to 0 _should_ result in deterministic outputs

It is a common misconception, but it is not true even in principle. If I have 2 or more logits which are equal to the maximum of my logits, I will sample uniformly random from them with any temperature, even zero. Sampling from softmax([1, 0, 1]) is still stochastic at temperature 0, because the limit is to sample uniformly from the first or the last element.

Anyway: "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs. GPUs put the associativity of the sums in matrix multiplications in arbitrary order, and this has a huge impact on the logits coming out of the neural network.

▲

EvgeniyZh 5 hours ago | parent | next [-]

You don't have to sample uniformly. You could take the lowest index of all maxima. But yeah, the main source of randomness is non-deterministic matmul, and temperature does nothing with it

▲

jstanley 3 hours ago | parent | prev | next [-]

> "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs.

But this isn't a fundamental property of LLMs, it's just an implementation detail. It's pretty obvious that if you evaluate the matrix multiplications correctly and deterministically sample from the highest-probability outputs, you will have a deterministic LLM.

▲

vbarrielle 2 hours ago | parent [-]

It may be an implementation detail, but in practice, if the only way to get a deterministic output is to run on the CPU, then it's not going to be usable.

	▲	Dylan16807 8 minutes ago \| parent \| next [-]
		You can tell GPUs what order to do math instructions in.
	▲	317070 an hour ago \| parent \| prev [-]
		Actually, Google's TPUs are also deterministic!

▲

DougBTX 3 hours ago | parent | prev [-]

> GPUs put the associativity of the sums in matrix multiplications in arbitrary order

That’s user-controlled too, not an inherent property of GPUs:

https://docs.pytorch.org/docs/2.12/generated/torch.use_deter...

	▲	vbarrielle 2 hours ago \| parent [-]
		The matrix multiplication is only deterministic for sparse-dense products under these settings: > torch.bmm() when called on sparse-dense CUDA tensors And it's not listed under the operations that raise an exception otherwise, so I'm not sure the docs promise that dense-dense matrix-matrix products are deterministic.