Why not allow the user to provide the seed used for the generation. That way at least we can detect if the model has changed if the same prompt with the same seed suddenly gives a new answer (assuming they don't cache answers), you could compare different providers which supposedly use the same model, and if the model is open-weight you could even compare yourself on your own hardware or on rented gpus.

▲

bthornbury 2 hours ago | parent | next [-]

AFAIK seed determinism can't really be relied upon between two machines, maybe not even between two different gpus.

▲

whatsupdog an hour ago | parent | next [-]

That doesn't seem correct. It's just matrix multiplications at the end. Doesn't matter if it's a different computer, GPU or even math on a napkin. Same seed, input and weights should give the same output. Please correct me if I'm wrong.

▲

jashulma an hour ago | parent | next [-]

https://thinkingmachines.ai/blog/defeating-nondeterminism-in... A nice write up explaining how it’s not as simple as it sounds

▲

tripplyons an hour ago | parent | prev | next [-]

There are many ways to compute the same matrix multiplication that apply the sum reduction in different orders, which can produce different answers when using floating point values. This is because floating point addition is not truly associative because of rounding.

	▲	spwa4 25 minutes ago \| parent [-]
		Is that really going to matter in FP32, FP16 or BF16? I would think models would be written so they'd be at least somewhat numerically stable. Also if the inference provider guarantees specific hardware this shouldn't happen.

▲

measurablefunc an hour ago | parent | prev [-]

You're assuming consistent hardware & software profiles. The way these things work at scale is essentially a compiler/instruction scheduling problem where you can think of different CPU/GPU combinations as the pipelines for what is basically a data center scale computer. The function graph is broken up into parts, compiled for different hardware profiles w/ different kernels, & then deployed & stitched together to maximize hardware utilization while minimizing cost. Service providers are not doing this b/c they want to but b/c they want to be profitable so every hardware cycle that is not used for querying or optimization is basically wasted money.

You'll never get agreement from any major companies on your proposal b/c that would mean they'd have to provide a real SLA for all of their customers & they'll never agree to that.

▲

maxilevi an hour ago | parent | prev [-]

thats not true in practice

	▲	tripplyons 43 minutes ago \| parent [-]
		It is definitely true across different chips. The best kernel to use will vary with what chip it is running on, which often implies that the underlying operations will be executed in a different order. For example, with floating point addition, adding up the same values in a different order can return a different result because floating point addition is not associative due to rounding.

▲

bthornbury 2 hours ago | parent | prev [-]

Something like a perplexity/log-likelihood measurement across a large enough number of prompts/tokens might get you the same in a statistical sense though. I expect those comparison percentages at the top are something like that.