Remix.run Logo
OGEnthusiast 9 hours ago

I'm not even sure how to evaluate what a "better" LLM is, when I've tried running the exact same model (Qwen3) and prompt and gotten vastly different responses on Qwen Chat vs OpenRouter vs running the model locally.

daemonologist 8 hours ago | parent | next [-]

There several reasons responses from the same model might vary:

- "temperature" - intentional random sampling from the most likely next tokens to improve "creativity" and help avoid repetition

- quantization - running models with lower numeric precision (saves on both memory and compute, without impacting accuracy too much)

- differences in/existence of a system prompt, especially when using something end-user-oriented like Qwen Chat

- not-quite-deterministic GPU acceleration

Benchmarks are usually run at temperature zero (always take the most likely next token), with the full-precision weights, and no additions to the benchmark prompt except necessary formatting and stuff like end-of-turn tokens. They also usually are multiple-choice or otherwise expect very short responses, which leaves less room for run-to-run variance.

Of course a benchmark still can't tell you everything - real-world performance can be very different.

magicalhippo 5 hours ago | parent | next [-]

AFAIK the batch your query lands in can also matter[1].

Though I imagine this should be a smaller effect than different quantization levels say.

[1]: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

OGEnthusiast 8 hours ago | parent | prev [-]

Thanks, this is a good checklist.

1899-12-30 9 hours ago | parent | prev | next [-]

That's a difference in the system prompt, not the model itself.

OGEnthusiast 7 hours ago | parent [-]

True yeah, good point.

jabroni_salad 7 hours ago | parent | prev [-]

I can't speak to qwen, but something interesting with Deepseek is that the official API supports almost no parameters, while the vllm hosts on openrouter do. The experience you get with the rehosters is wildly different since you can use samplers.