Remix.run Logo
cush 14 hours ago

Is setting temperature to 0 even a valid way to measure LLM performance over time, all else equal?

criemen 14 hours ago | parent | next [-]

Even with temperature 0, the LLM output will not be deterministic. It will just have less randomness (not defined precisely) than with temperature 1. There was a recent post on the frontpage about fully deterministic sampling, but it turns out to be quite difficult.

visarga 13 hours ago | parent [-]

It's because batch size is dynamic. So a different batch size will change the output even on temp 0.

jonplackett 14 hours ago | parent | prev | next [-]

It could be that performance on temp zero has declined but performance on a normal temp is the same or better.

I wonder if temp zero would be more influenced by changes to the system prompt too. I can imagine it making responses more brittle.

Spivak 12 hours ago | parent | prev | next [-]

I don't think it's a valid measure across models but, as in the OP, it's a great measure for when they mess with "the same model" behind the scenes.

That being said we also do keep a test suite to check that model updates don't result in worse results for our users and it worked well enough. We had to skip a few versions of Sonnet because it stopped being able to complete tasks (on the same data) it could previously. I don't blame Anthropic, I would be crazy to assume that new models are a strict improvement across all tasks and domains.

I do just wish they would stop depreciating old models, once you have something working to your satisfaction it would be nice to freeze it. Ah well, only for local models.

fortyseven 14 hours ago | parent | prev [-]

I'd have assumed a fixed seed was used, but he doesn't mention that. Weird. Maybe he meant that?

numpad0 12 hours ago | parent | next [-]

Pure sci-Fi idea: what if actually nothing was changed, but RNGs were becoming less random as we extract more randomness out of the universe?

maxbond 8 hours ago | parent | prev [-]

I bet they did both. If I'm reading the documentation right you have to supply a seed in order to get "best effort" determinism.

https://learn.microsoft.com/en-us/azure/ai-foundry/openai/re...