| ▲ | postalcoder 2 days ago | |||||||||||||||||||||||||||||||||||||||||||
One of my favorite personal evals for llms is testing its stability as a reviewer. The basic gist of it is to give the llm some code to review and have it assign a grade multiple times. How much variance is there in the grade? Then, prompt the same llm to be a "critical" reviewer with the same code multiple times. How much does that average critical grade change? A low variance of grades across many generations and a low delta between "review this code" and "review this code with a critical eye" is a major positive signal for quality. I've found that gpt-5.1 produces remarkably stable evaluations whereas Claude is all over the place. Furthermore, Claude will completely [and comically] change the tenor of its evaluation when asked to be critical whereas gpt-5.1 is directionally the same while tightening the screws. You could also interpret these results to be a proxy for obsequiousness. Edit: One major part of the eval i left out is "can an llm converge on an 'A'?" Let's say the llm gives the code a 6/10 (or B-). When you implement its suggestions and then provide the improved code in a new context, does the grade go up? Furthermore, can it eventually give itself an A, and consistently? It's honestly impressive how good, stable, and convergent gpt-5.1 is. Claude is not great. I have yet to test it on Gemini 3. | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | lemming 2 days ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||
I agree, I mostly use Claude for writing code, but I always get GPT5 to review it. Like you, I find it astonishingly consistent and useful, especially compared to Claude. I like to reset my context frequently, so I’ll often paste the problems from GPT into Claude, then get it to review those fixes (going around that loop a few times), then reset the context and get it to do a new full review. It’s very reassuring how consistent the results are. | ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | adastra22 2 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
You mean literally assign a grade, like B+? This is unlikely to work based on how token prediction & temperature works. You're going to get a probability distribution in the end that is reflective of the model runtime parameters, not the intelligence of the model. | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | OsrsNeedsf2P 2 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||
How is this different than testing the temperature? | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
| ▲ | guluarte 2 days ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||
my experience reviewing pr is that sometimes it says it is perfect with some nipicks and other times the same pr that it is trash and need a lot of work | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||