| ▲ | pythonaut_16 2 hours ago | |
Seems like a bunch of noise. What does this even mean? It sounds like you're saying "Actually you, as a human, are simply not smart enough to evaluate Opus 4.8" | ||
| ▲ | aspenmartin an hour ago | parent [-] | |
No it’s: evaluating these systems are complex and there’s a reason why sociology, cognitive psychology, medicine, etc are all done in careful double blind conditions with pre registered tests. It’s not that humans are not smart enough, as I said human evaluations are incredibly important. And yet they are a minefield of biases you have to worry about and correct for. - evaluations need to be done at the same time to avoid drift in your bias - you need to worry about your test set: which questions are you asking? How many of them? Are they representative of your work? - which one did you do first? Raters have a tendency to bias in one direction or another - you also know the label! You know which model is which! This biases your assessment… And on and on and on. Careful science exists for a reason. | ||