Remix.run Logo
kostaj 2 hours ago

Yes, inter-human-annotator disagreement is also high on similar type of questions (AVeriTeC) - inter-panel agreement: κ=0.619. Tried giving the models a fifth option, Abstain, but some models seem to use it to "avoid answering hard questions" more than others.