| ▲ | swid 4 days ago | |
It surely is different. If you set the temp to 0 and do the test with slightly different wording, there is no guarantee at all the scores would be consistent. And if an LLM is consistent, even with a high temp, it could give the same PR the same grade while choosing different words to say. The tokens are still chosen from the distribution, so a higher probability of the same grade will result in the same grade being chosen regardless of the temp set. | ||
| ▲ | smt88 3 days ago | parent [-] | |
I think you're restating (in a longer and more accurate way) what I understood the original criticism to be, that this grading test isn't testing what's it's supposed to, partly because a grade is too few tokens. The model could "assess" the code qualitatively the same and still give slightly different letter grades. | ||