Without providing definitions of "True / Mostly True / Misleading / False" to each rater, I rate the article's claim that "Only one verdict bucket can be correct per claim" as false.

Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?

How much can something be wrong before it goes from "mostly true" to "false" (objectively, both have some part of the fact that is not true)?

This is at least partly testing the model's definition of "mostly" and "misleading". Not its understanding of the fact. Claiming that this means the models have fundamental disagreement on the facts themselves is an overreach.

▲ wongarsu 2 hours ago | parent | next [-]

Yes, the labels are weird. Most misleading statements are true. Any "mostly true" statement is false.

I suspect the intention was "Factually true, and no gotchas exist", "technically not true, but so close to the truth that the difference doesn't matter", "technically true, but there are major gotchas" and "factually false and not even close". But that's not what they specified

▲

daveguy 2 hours ago | parent [-]

Better options would have been "True", "False", "Unknown" (which opinions would fall under too). That also includes an interesting assessment of how well LLMs can identify missing information. My guess is they would be a very low number of "unknown" and a much higher level of agreement (assuming equal representation). Unless the RLHF techniques have gotten better at getting an LLM to say "I don't know", which I doubt. Saying "I don't know" is not good for a dopamine release to keep users coming back for more.

▲

kostaj 2 hours ago | parent | next [-]

Tried initially with a fifth bucket, Abstain. It was actually heavily used by some of the models. But it felt as if they are using this to "avoid" some of the hard questions, and we dropped this bucket to force them to provide a verdict.

▲

john_strinlai 2 hours ago | parent | next [-]

>But it felt as if they are using this to "avoid" some of the hard questions, and we dropped this bucket to force them to provide a verdict.

do you not see how that creates extremely misleading and valueless results? you are coercing the results into what you want to see.

	▲	moritzwarhier an hour ago \| parent [-]
		Exactly what people do when they use LLMs for "fact-checking" online, and any verbose explanation would be mostly ignored anyway, when people ask political, ethical, or simply ambiguous questions that they hold any stakes in. Don't even need politics for it, there is no point in probing a mathematical black box for "how many soldiers died in the year X in war Y". Any original source is preferable to a blurry "summary" of unknown sources, and this is why the article has a valuable point. There's also no point in asking "Is Paris in France" either, if you substitute city and country with real data. An encyclopedia or manual check of different sources such as maps, while not infallible, is a better source. If you already know the country Paris belongs to, there's no point in asking, anyway.

▲

kostaj an hour ago | parent | prev | next [-]

@john_strinlai @gcr, depends on the application. In many cases an "I don't know" answer is indeed better than a forced answer. But in many production systems, LLMs generate content/response anyway.

Although inheriting the messiness of the real-world, the majority of these claims are objective enough to be classifiable by human experts with access to research. Plan to human-label the 1,000 claims and publish a follow-up research. Will consider adding an "I don't know" bucket too, as well as a clear instructions about the meaning of each of the 4 buckets.

▲

simonw an hour ago | parent [-]

If you're going to run this again I also recommend encouraging the model to provide its rationale and then having it return the true/false/misleading/mostly-true/abstain at the end of its response.

Models give much better answers when they can "think out loud" before answering, and storing that rationale will make it easier to understand why they picked different answers for ambiguous questions.

	▲	kostaj 10 minutes ago \| parent [-]
		Good point. Processing the substance of the answer might be too labor-consuming (1,000 claims x 5 models), but "thinking out loud" might improve the quality of the answers indeed. And we can still force/ask them to respond with a clear verdict at the end of their reasoning, as per the chosen rubric.

▲

RobotToaster an hour ago | parent | prev | next [-]

I'm sorry, but many of the statements that you fed it are verifiably unknown, and you didn't give it an "unknown" option? This is the academic equivalent of clickbait.

▲

onceonceonce an hour ago | parent | prev | next [-]

Teams I work with use the abstain rate to flag what goes to a human. Disagreement between models is the same idea. Your 67% is what makes "two cheap models, escalate when they fight" actually work. Without abstain it mostly looks like noise.

▲

fumeux_fume 9 minutes ago | parent | prev | next [-]

Do you understand how problematic this is?

▲

gcr an hour ago | parent | prev | next [-]

Shouldn't that be part of the test?

Real-world systems need to be able to say "I don't know." This is a test about misinformation after all, and overconfident responses contribute to that.

Teasing out the difference between "avoid" and "unknown" could be a different research question

▲

aayushkumar121 29 minutes ago | parent | prev | next [-]

[dead]

▲

sibidharan 41 minutes ago | parent | prev [-]

[dead]

▲

skybrian 23 minutes ago | parent | prev [-]

I wouldn’t expect opinions to go into “unknown.” Maybe have an “it’s complicated” bucket.

▲ pjc50 2 hours ago | parent | prev | next [-]

If you can consistently construct "true but misleading" content, you may be qualified to work at a major newspaper.

▲

falcor84 2 hours ago | parent | next [-]

> true but misleading

It seems to me that for many newspapers the bar is now significantly lower, at something like "not quite entirely untrue"

	▲	IanCal an hour ago \| parent \| next [-]
		Almost, but not entirely, quite unlike the truth.
	▲	kevin_thibedeau 2 hours ago \| parent \| prev [-]
		Allegedly.

▲

daveguy 2 hours ago | parent | prev [-]

As if right wing propaganda shows and manosphere blogs haven't been knocking those out of the park for the last decade+. Although I guess you could say flat out lies are more their jam. Newspapers at least require confirmed sources. You know, journalism.

▲ 8 minutes ago | parent | prev | next [-]

[deleted]

▲ torben-friis 2 hours ago | parent | prev | next [-]

>Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?

Disagree. The definition of misleading is a true fact that is presented in a way to lead you to a false conclusion.

Example: "Most good engineers are male". It is true as a consequence of most engineers being male in general, but it leads the reader to a potential false implication that an average man is better than an average woman.

This does not invalid your point though. Things can be true and misleading.

▲ SkyBelow an hour ago | parent | next [-]

Isn't this still assuming we can even determine what is true or false?

Newtonian physics is false, but it works well enough we teach it in college. But our best models of physics are currently in disagreement, so can we even say they are true? Given the replication crisis, especially in social sciences, how many of peer reviewed findings can be called true? Even experimental results can be false (consider studies that found FTL neutrinos, which were rejected as an error in the experiment, and which was eventually confirmed but it took quite a lot of work and in a softer field than physics with a claim less absurd than FTL, would have likely long been accepted as a true finding).

Even in math, basic statements aren't really true or false, but more a question of "given these axioms, can we prove or disprove it" noting that we have different systems with different axioms. If we are talking basic sets, most people are using naive set theory which is inherently contradictory, which means that notions like true or false probably can't be considered well defined.

▲ xienze an hour ago | parent | prev [-]

> but it leads the reader to a potential false implication that an average man is better than an average woman.

I think that's _you_ turning the statement into something much broader than intended. The claim is about engineers and you're jumping from "men are better than women in engineering" to "men are better overall."

To give a related example, "Most good NBA players are black." I don't think anyone would bother trying to couch this in a bunch of "well, for all we know that's just a function of more NBA players being black than white" arguments, nor would anyone be lead to think "the average black man is better than the average white man" as a result of that statement. I _do_ agree however that there are some people who see rather narrowly-defined statements and turn them into something they're not...

	▲	libria 2 minutes ago \| parent [-]
		At least Gemini 3.5 is fair about it: `Classify this claim: "Most good engineers are male." Misleading Classify this claim: "Most bad engineers are male." Misleading` And not particularly racially sensitive `Classify this claim: "Most good NBA players are black." True Classify this claim: "Most good NHL players are white." True` It explained it is more confident when assessing the small, highly quantifiable population of sports professionals vs a very large, diverse population of "engineers".

▲ embedding-shape 2 hours ago | parent | prev | next [-]

> I guess the goal is to test the models and not the harness

Less important than the harness, is the system/user prompts themselves (which of course, are put in the harness), which is effectively what this study seems to be testing. With a better prompt, I'm sure the models would look more the same to each other, as the biggest/best models have more or less identical strong prompt-adherence in my experience.

▲ bayindirh an hour ago | parent | prev | next [-]

But the models are more intelligent than humans already and sentient beings, right? So they shall know the meanings innately. So, you don’t need to explain them what they mean.

You may give them better instructions, but they should already have the intellect to understand the assignment.

Right, right?

▲

altcognito 34 minutes ago | parent | next [-]

I know you're being facetious, but I think this is correct. The model might ask for clarification when given clearly borderline questions that tread the line between what is true, what is false, and even what is misleading. But there's the rub of someone being disingenious and saying "no explanation! Just answer!" It was a trap to begin with.

I don't think there is anything wrong with the results of this test.

It would be more interesting if we compared them to human results.

If you have trouble distinguishing between human and LLM results, that's interesting.

Also, sentient is irrelevant to this test.

▲

simonw an hour ago | parent | prev [-]

> But the models are more intelligent than humans already and sentient beings, right?

Only if you listen to charlatans.

	▲	bayindirh an hour ago \| parent [-]
		True. If you didn't know my stance on AI already, here's a primer :) [0]. IOW, that comment was a sarcastic poke from someone who already supports AI workloads at work and have some knowledge about how all this works. ;) [0]: https://notes.bayindirh.io/notes/Lists/Discussions+about+Art...

▲ ForHackernews an hour ago | parent | prev [-]

> Something can be simultaneously "misleading" and either true or false.

Sure they can. It might be a true fact that "100% of the murders committed in <town> over the last 25 years were committed by <some racial group>!" but actually it's a town of 750 people and there was only one murder during that time frame.