▲ | daemonologist 8 hours ago | |
There several reasons responses from the same model might vary: - "temperature" - intentional random sampling from the most likely next tokens to improve "creativity" and help avoid repetition - quantization - running models with lower numeric precision (saves on both memory and compute, without impacting accuracy too much) - differences in/existence of a system prompt, especially when using something end-user-oriented like Qwen Chat - not-quite-deterministic GPU acceleration Benchmarks are usually run at temperature zero (always take the most likely next token), with the full-precision weights, and no additions to the benchmark prompt except necessary formatting and stuff like end-of-turn tokens. They also usually are multiple-choice or otherwise expect very short responses, which leaves less room for run-to-run variance. Of course a benchmark still can't tell you everything - real-world performance can be very different. | ||
▲ | magicalhippo 5 hours ago | parent | next [-] | |
AFAIK the batch your query lands in can also matter[1]. Though I imagine this should be a smaller effect than different quantization levels say. [1]: https://thinkingmachines.ai/blog/defeating-nondeterminism-in... | ||
▲ | OGEnthusiast 8 hours ago | parent | prev [-] | |
Thanks, this is a good checklist. |