| ▲ | kostaj 2 hours ago | |
This paper covers only the disagreement between models and established only the floor of the error, based on the disagreement, but not which model is better. Planning to follow up with another study to benchmark against human-labelled verdicts still using a corpus that the models have not seen during training. | ||
| ▲ | aspenmartin 44 minutes ago | parent [-] | |
You also need to involve better measures of agreement that are standard in the literature like krippendorfs alpha with ordinal metric. So many footguns in this methodology | ||