| ▲ | kostaj 2 hours ago | |
Indeed. I prompted each model ones, plus one retry on errors. Very good point to measure the inter-model disagreement! Will add in the next version. Section "4.2 Agreement w/ peer majority" shows the level of agreement of each model with the majority. Yes, planning of human-labelling the same corpus of 1,000 claims and publishing a second study measuring the models performance against the human-labels on corpus that the models have not seen during training. | ||