Remix.run Logo
daemonologist 8 hours ago

There several reasons responses from the same model might vary:

- "temperature" - intentional random sampling from the most likely next tokens to improve "creativity" and help avoid repetition

- quantization - running models with lower numeric precision (saves on both memory and compute, without impacting accuracy too much)

- differences in/existence of a system prompt, especially when using something end-user-oriented like Qwen Chat

- not-quite-deterministic GPU acceleration

Benchmarks are usually run at temperature zero (always take the most likely next token), with the full-precision weights, and no additions to the benchmark prompt except necessary formatting and stuff like end-of-turn tokens. They also usually are multiple-choice or otherwise expect very short responses, which leaves less room for run-to-run variance.

Of course a benchmark still can't tell you everything - real-world performance can be very different.

magicalhippo 5 hours ago | parent | next [-]

AFAIK the batch your query lands in can also matter[1].

Though I imagine this should be a smaller effect than different quantization levels say.

[1]: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

OGEnthusiast 8 hours ago | parent | prev [-]

Thanks, this is a good checklist.