Load just makes LLMs behave less deterministically and likely degrade. See: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

They don't have to be malicious operators in this case. It just happens.

▲

bgirard 2 hours ago | parent | next [-]

> malicious

It doesn't have to be malicious. If my workflow is to send a prompt once and hopefully accept the result, then degradation matters a lot. If degradation is causing me to silently get worse code output on some of my commits it matters to me.

I care about -expected- performance when picking which model to use, not optimal benchmark performance.

	▲	Aurornis an hour ago \| parent \| next [-]
		Non-determinism isn’t the same as degradation. The non-determinism means that even with a temperature of 0.0, you can’t expect the outputs to be the same across API calls. In practice people tend to index to the best results they’ve experienced and view anything else as degradation. In practice it may just be randomness in either direction from the prompts. When you’re getting good results you assume it’s normal. When things feel off you think something abnormal is happening. Rerun the exact same prompts and context with temperature 0 and you might get a different result.
	▲	novaleaf an hour ago \| parent \| prev [-]
		this is about variance of daily statistics, so I think the suggestions are entirely appropriate in this context.

▲

strongpigeon 17 minutes ago | parent | prev | next [-]

The question I have now after reading this paper (which was really insightful) is do the models really get worse under load, or do they just have a higher variance? It seems like the latter is what we should expect, not it getting worse, but absent load data we can't really know.

▲

altcognito 2 hours ago | parent | prev | next [-]

Explain this though. The code is deterministic, even if it relies on pseudo random number generation. It doesn't just happen, someone has to make a conscious decision to force a different code path (or model) if the system is loaded.

▲

minimaltom 44 minutes ago | parent | next [-]

Its not deterministic. Any individual floating point mul/add is deterministic, but in a GPU these are all happening in parallel and the accumulation is in the order they happen to complete.

When you add A then B then C, you get a different answer than C then A then B, because floating point, approximation error, subnormals etc.

▲

chrisjj an hour ago | parent | prev | next [-]

Not deterministic. https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

▲

FL33TW00D an hour ago | parent | prev | next [-]

It takes a different code path for efficiency.

e.g

if (batch_size > 1024): kernel_x else: kernel_y

▲

pertymcpert an hour ago | parent | prev [-]

Floating point math isn't associative for operations that are associative in normal math.

	▲	measurablefunc 39 minutes ago \| parent [-]
		That would just add up to statistical noise instead of 10% degradation over a week.

▲

stefan_ 16 minutes ago | parent | prev [-]

The primary (non malicious, non stupid) explanation given here is batching. But I think you would find looking at large-scale inference the batch sizes being ran on any given rig are fairly static - there is a sweet spot for any given model part ran individually between memory consumption and GPU utilization, and generally GPUs do badly at job parallelism.

I think the more likely explanation is again with the extremely heterogeneous compute platforms they run on.