Remix clone Hacker News

new | show | ask | jobs Github

	▲	daemonologist 8 hours ago
		There several reasons responses from the same model might vary: - "temperature" - intentional random sampling from the most likely next tokens to improve "creativity" and help avoid repetition - quantization - running models with lower numeric precision (saves on both memory and compute, without impacting accuracy too much) - differences in/existence of a system prompt, especially when using something end-user-oriented like Qwen Chat - not-quite-deterministic GPU acceleration Benchmarks are usually run at temperature zero (always take the most likely next token), with the full-precision weights, and no additions to the benchmark prompt except necessary formatting and stuff like end-of-turn tokens. They also usually are multiple-choice or otherwise expect very short responses, which leaves less room for run-to-run variance. Of course a benchmark still can't tell you everything - real-world performance can be very different.
	▲	magicalhippo 5 hours ago \| parent \| next [-]
		AFAIK the batch your query lands in can also matter[1]. Though I imagine this should be a smaller effect than different quantization levels say. [1]: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
	▲	OGEnthusiast 8 hours ago \| parent \| prev [-]
		Thanks, this is a good checklist.