Yes, the labels are weird. Most misleading statements are true. Any "mostly true" statement is false.

I suspect the intention was "Factually true, and no gotchas exist", "technically not true, but so close to the truth that the difference doesn't matter", "technically true, but there are major gotchas" and "factually false and not even close". But that's not what they specified

▲

daveguy 2 hours ago | parent [-]

Better options would have been "True", "False", "Unknown" (which opinions would fall under too). That also includes an interesting assessment of how well LLMs can identify missing information. My guess is they would be a very low number of "unknown" and a much higher level of agreement (assuming equal representation). Unless the RLHF techniques have gotten better at getting an LLM to say "I don't know", which I doubt. Saying "I don't know" is not good for a dopamine release to keep users coming back for more.

▲

kostaj 2 hours ago | parent | next [-]

Tried initially with a fifth bucket, Abstain. It was actually heavily used by some of the models. But it felt as if they are using this to "avoid" some of the hard questions, and we dropped this bucket to force them to provide a verdict.

▲

john_strinlai 2 hours ago | parent | next [-]

>But it felt as if they are using this to "avoid" some of the hard questions, and we dropped this bucket to force them to provide a verdict.

do you not see how that creates extremely misleading and valueless results? you are coercing the results into what you want to see.

	▲	moritzwarhier an hour ago \| parent [-]
		Exactly what people do when they use LLMs for "fact-checking" online, and any verbose explanation would be mostly ignored anyway, when people ask political, ethical, or simply ambiguous questions that they hold any stakes in. Don't even need politics for it, there is no point in probing a mathematical black box for "how many soldiers died in the year X in war Y". Any original source is preferable to a blurry "summary" of unknown sources, and this is why the article has a valuable point. There's also no point in asking "Is Paris in France" either, if you substitute city and country with real data. An encyclopedia or manual check of different sources such as maps, while not infallible, is a better source. If you already know the country Paris belongs to, there's no point in asking, anyway.

▲

kostaj an hour ago | parent | prev | next [-]

@john_strinlai @gcr, depends on the application. In many cases an "I don't know" answer is indeed better than a forced answer. But in many production systems, LLMs generate content/response anyway.

Although inheriting the messiness of the real-world, the majority of these claims are objective enough to be classifiable by human experts with access to research. Plan to human-label the 1,000 claims and publish a follow-up research. Will consider adding an "I don't know" bucket too, as well as a clear instructions about the meaning of each of the 4 buckets.

▲

simonw an hour ago | parent [-]

If you're going to run this again I also recommend encouraging the model to provide its rationale and then having it return the true/false/misleading/mostly-true/abstain at the end of its response.

Models give much better answers when they can "think out loud" before answering, and storing that rationale will make it easier to understand why they picked different answers for ambiguous questions.

	▲	kostaj 10 minutes ago \| parent [-]
		Good point. Processing the substance of the answer might be too labor-consuming (1,000 claims x 5 models), but "thinking out loud" might improve the quality of the answers indeed. And we can still force/ask them to respond with a clear verdict at the end of their reasoning, as per the chosen rubric.

▲

RobotToaster an hour ago | parent | prev | next [-]

I'm sorry, but many of the statements that you fed it are verifiably unknown, and you didn't give it an "unknown" option? This is the academic equivalent of clickbait.

▲

onceonceonce an hour ago | parent | prev | next [-]

Teams I work with use the abstain rate to flag what goes to a human. Disagreement between models is the same idea. Your 67% is what makes "two cheap models, escalate when they fight" actually work. Without abstain it mostly looks like noise.

▲

fumeux_fume 9 minutes ago | parent | prev | next [-]

Do you understand how problematic this is?

▲

gcr an hour ago | parent | prev | next [-]

Shouldn't that be part of the test?

Real-world systems need to be able to say "I don't know." This is a test about misinformation after all, and overconfident responses contribute to that.

Teasing out the difference between "avoid" and "unknown" could be a different research question

▲

aayushkumar121 28 minutes ago | parent | prev | next [-]

[dead]

▲

sibidharan 41 minutes ago | parent | prev [-]

[dead]

▲

skybrian 23 minutes ago | parent | prev [-]

I wouldn’t expect opinions to go into “unknown.” Maybe have an “it’s complicated” bucket.