Is setting temperature to 0 even a valid way to measure LLM performance over time, all else equal?

criemen 14 hours ago | parent | next [-]

Even with temperature 0, the LLM output will not be deterministic. It will just have less randomness (not defined precisely) than with temperature 1. There was a recent post on the frontpage about fully deterministic sampling, but it turns out to be quite difficult.

	▲	visarga 13 hours ago \| parent [-]
		It's because batch size is dynamic. So a different batch size will change the output even on temp 0.

▲

jonplackett 14 hours ago | parent | prev | next [-]

It could be that performance on temp zero has declined but performance on a normal temp is the same or better.

I wonder if temp zero would be more influenced by changes to the system prompt too. I can imagine it making responses more brittle.

▲

Spivak 12 hours ago | parent | prev | next [-]

I don't think it's a valid measure across models but, as in the OP, it's a great measure for when they mess with "the same model" behind the scenes.

That being said we also do keep a test suite to check that model updates don't result in worse results for our users and it worked well enough. We had to skip a few versions of Sonnet because it stopped being able to complete tasks (on the same data) it could previously. I don't blame Anthropic, I would be crazy to assume that new models are a strict improvement across all tasks and domains.

I do just wish they would stop depreciating old models, once you have something working to your satisfaction it would be nice to freeze it. Ah well, only for local models.

▲

fortyseven 14 hours ago | parent | prev [-]

I'd have assumed a fixed seed was used, but he doesn't mention that. Weird. Maybe he meant that?

	▲	numpad0 12 hours ago \| parent \| next [-]
		Pure sci-Fi idea: what if actually nothing was changed, but RNGs were becoming less random as we extract more randomness out of the universe?
	▲	maxbond 8 hours ago \| parent \| prev [-]
		I bet they did both. If I'm reading the documentation right you have to supply a seed in order to get "best effort" determinism. https://learn.microsoft.com/en-us/azure/ai-foundry/openai/re...