| ▲ | kostaj 2 hours ago | ||||||||||||||||
Tried initially with a fifth bucket, Abstain. It was actually heavily used by some of the models. But it felt as if they are using this to "avoid" some of the hard questions, and we dropped this bucket to force them to provide a verdict. | |||||||||||||||||
| ▲ | john_strinlai 2 hours ago | parent | next [-] | ||||||||||||||||
>But it felt as if they are using this to "avoid" some of the hard questions, and we dropped this bucket to force them to provide a verdict. do you not see how that creates extremely misleading and valueless results? you are coercing the results into what you want to see. | |||||||||||||||||
| |||||||||||||||||
| ▲ | kostaj an hour ago | parent | prev | next [-] | ||||||||||||||||
@john_strinlai @gcr, depends on the application. In many cases an "I don't know" answer is indeed better than a forced answer. But in many production systems, LLMs generate content/response anyway. Although inheriting the messiness of the real-world, the majority of these claims are objective enough to be classifiable by human experts with access to research. Plan to human-label the 1,000 claims and publish a follow-up research. Will consider adding an "I don't know" bucket too, as well as a clear instructions about the meaning of each of the 4 buckets. | |||||||||||||||||
| |||||||||||||||||
| ▲ | RobotToaster an hour ago | parent | prev | next [-] | ||||||||||||||||
I'm sorry, but many of the statements that you fed it are verifiably unknown, and you didn't give it an "unknown" option? This is the academic equivalent of clickbait. | |||||||||||||||||
| ▲ | onceonceonce an hour ago | parent | prev | next [-] | ||||||||||||||||
Teams I work with use the abstain rate to flag what goes to a human. Disagreement between models is the same idea. Your 67% is what makes "two cheap models, escalate when they fight" actually work. Without abstain it mostly looks like noise. | |||||||||||||||||
| ▲ | fumeux_fume 7 minutes ago | parent | prev | next [-] | ||||||||||||||||
Do you understand how problematic this is? | |||||||||||||||||
| ▲ | gcr an hour ago | parent | prev | next [-] | ||||||||||||||||
Shouldn't that be part of the test? Real-world systems need to be able to say "I don't know." This is a test about misinformation after all, and overconfident responses contribute to that. Teasing out the difference between "avoid" and "unknown" could be a different research question | |||||||||||||||||
| ▲ | aayushkumar121 26 minutes ago | parent | prev | next [-] | ||||||||||||||||
[dead] | |||||||||||||||||
| ▲ | sibidharan 39 minutes ago | parent | prev [-] | ||||||||||||||||
[dead] | |||||||||||||||||