▲ | cush 14 hours ago | |||||||||||||
Is setting temperature to 0 even a valid way to measure LLM performance over time, all else equal? | ||||||||||||||
▲ | criemen 14 hours ago | parent | next [-] | |||||||||||||
Even with temperature 0, the LLM output will not be deterministic. It will just have less randomness (not defined precisely) than with temperature 1. There was a recent post on the frontpage about fully deterministic sampling, but it turns out to be quite difficult. | ||||||||||||||
| ||||||||||||||
▲ | jonplackett 14 hours ago | parent | prev | next [-] | |||||||||||||
It could be that performance on temp zero has declined but performance on a normal temp is the same or better. I wonder if temp zero would be more influenced by changes to the system prompt too. I can imagine it making responses more brittle. | ||||||||||||||
▲ | Spivak 12 hours ago | parent | prev | next [-] | |||||||||||||
I don't think it's a valid measure across models but, as in the OP, it's a great measure for when they mess with "the same model" behind the scenes. That being said we also do keep a test suite to check that model updates don't result in worse results for our users and it worked well enough. We had to skip a few versions of Sonnet because it stopped being able to complete tasks (on the same data) it could previously. I don't blame Anthropic, I would be crazy to assume that new models are a strict improvement across all tasks and domains. I do just wish they would stop depreciating old models, once you have something working to your satisfaction it would be nice to freeze it. Ah well, only for local models. | ||||||||||||||
▲ | fortyseven 14 hours ago | parent | prev [-] | |||||||||||||
I'd have assumed a fixed seed was used, but he doesn't mention that. Weird. Maybe he meant that? | ||||||||||||||
|