| ▲ | botacode 2 hours ago | ||||||||||||||||||||||||||||||||||
Load just makes LLMs behave less deterministically and likely degrade. See: https://thinkingmachines.ai/blog/defeating-nondeterminism-in... They don't have to be malicious operators in this case. It just happens. | |||||||||||||||||||||||||||||||||||
| ▲ | bgirard 2 hours ago | parent | next [-] | ||||||||||||||||||||||||||||||||||
> malicious It doesn't have to be malicious. If my workflow is to send a prompt once and hopefully accept the result, then degradation matters a lot. If degradation is causing me to silently get worse code output on some of my commits it matters to me. I care about -expected- performance when picking which model to use, not optimal benchmark performance. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
| ▲ | strongpigeon 17 minutes ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
The question I have now after reading this paper (which was really insightful) is do the models really get worse under load, or do they just have a higher variance? It seems like the latter is what we should expect, not it getting worse, but absent load data we can't really know. | |||||||||||||||||||||||||||||||||||
| ▲ | altcognito 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
Explain this though. The code is deterministic, even if it relies on pseudo random number generation. It doesn't just happen, someone has to make a conscious decision to force a different code path (or model) if the system is loaded. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
| ▲ | stefan_ 16 minutes ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||
The primary (non malicious, non stupid) explanation given here is batching. But I think you would find looking at large-scale inference the batch sizes being ran on any given rig are fairly static - there is a sweet spot for any given model part ran individually between memory consumption and GPU utilization, and generally GPUs do badly at job parallelism. I think the more likely explanation is again with the extremely heterogeneous compute platforms they run on. | |||||||||||||||||||||||||||||||||||