| ▲ | aspenmartin an hour ago | |
No it’s: evaluating these systems are complex and there’s a reason why sociology, cognitive psychology, medicine, etc are all done in careful double blind conditions with pre registered tests. It’s not that humans are not smart enough, as I said human evaluations are incredibly important. And yet they are a minefield of biases you have to worry about and correct for. - evaluations need to be done at the same time to avoid drift in your bias - you need to worry about your test set: which questions are you asking? How many of them? Are they representative of your work? - which one did you do first? Raters have a tendency to bias in one direction or another - you also know the label! You know which model is which! This biases your assessment… And on and on and on. Careful science exists for a reason. | ||