One of my favorite personal evals for llms is testing its stability as a reviewer.

The basic gist of it is to give the llm some code to review and have it assign a grade multiple times. How much variance is there in the grade?

Then, prompt the same llm to be a "critical" reviewer with the same code multiple times. How much does that average critical grade change?

A low variance of grades across many generations and a low delta between "review this code" and "review this code with a critical eye" is a major positive signal for quality.

I've found that gpt-5.1 produces remarkably stable evaluations whereas Claude is all over the place. Furthermore, Claude will completely [and comically] change the tenor of its evaluation when asked to be critical whereas gpt-5.1 is directionally the same while tightening the screws.

You could also interpret these results to be a proxy for obsequiousness.

Edit: One major part of the eval i left out is "can an llm converge on an 'A'?" Let's say the llm gives the code a 6/10 (or B-). When you implement its suggestions and then provide the improved code in a new context, does the grade go up? Furthermore, can it eventually give itself an A, and consistently?

It's honestly impressive how good, stable, and convergent gpt-5.1 is. Claude is not great. I have yet to test it on Gemini 3.

▲

lemming 2 days ago | parent | next [-]

I agree, I mostly use Claude for writing code, but I always get GPT5 to review it. Like you, I find it astonishingly consistent and useful, especially compared to Claude. I like to reset my context frequently, so I’ll often paste the problems from GPT into Claude, then get it to review those fixes (going around that loop a few times), then reset the context and get it to do a new full review. It’s very reassuring how consistent the results are.

▲

adastra22 2 days ago | parent | prev | next [-]

You mean literally assign a grade, like B+? This is unlikely to work based on how token prediction & temperature works. You're going to get a probability distribution in the end that is reflective of the model runtime parameters, not the intelligence of the model.

▲

2 days ago | parent | next [-]

[deleted]

▲

postalcoder a day ago | parent | prev [-]

the gpt-5 reasoning models do not have a configurable temperature.

There's a reason why reasoning models are bad for creative writing. The thinking constrains the output.

	▲	adastra22 a day ago \| parent [-]
		Doesn’t matter if it is configurable. It is still there in the inference algorithm.

▲

OsrsNeedsf2P 2 days ago | parent | prev | next [-]

How is this different than testing the temperature?

▲

itishappy 2 days ago | parent | next [-]

How does temperature explain the variance in response to the inclusion of the word "critical"?

▲

smt88 2 days ago | parent | prev [-]

It isn't, and it reflects how deeply LLMs are misunderstood, even by technical people

▲

postalcoder a day ago | parent | next [-]

gpt-5* reasoning models do not have an adjustable temperature parameter. It seems like we may have a different understanding of these models.

And, like the other commenter said, the temperature may change the distribution of the next token, but the reasoning tends to reel those things in, which is why reasoning models are notoriously poor at creative writing.

You are free to run these experiments for yourself. Perhaps, with your deeper understanding, you'll shed new light on this behavior.

▲

swid a day ago | parent | prev | next [-]

It surely is different. If you set the temp to 0 and do the test with slightly different wording, there is no guarantee at all the scores would be consistent.

And if an LLM is consistent, even with a high temp, it could give the same PR the same grade while choosing different words to say.

The tokens are still chosen from the distribution, so a higher probability of the same grade will result in the same grade being chosen regardless of the temp set.

	▲	smt88 a day ago \| parent [-]
		I think you're restating (in a longer and more accurate way) what I understood the original criticism to be, that this grading test isn't testing what's it's supposed to, partly because a grade is too few tokens. The model could "assess" the code qualitatively the same and still give slightly different letter grades.

▲

stevenhuang a day ago | parent | prev [-]

The irony is strong here.

▲

guluarte 2 days ago | parent | prev [-]

my experience reviewing pr is that sometimes it says it is perfect with some nipicks and other times the same pr that it is trash and need a lot of work

	▲	2 days ago \| parent [-]
		[deleted]