This seems like another case where the models are acting like humans. Assuming they were not allowed to search the web, I wouldn't expect the models to necessarily have detailed information about all of these things directly in their training set. As large as they are, they are only so large, and they only have so much room for "information storage" in them, and there's a lot more things they need to fit into their numbers.

This test is of only marginal utility in the real world compared to an AI with access to the web. While I wouldn't expect an AI with access to the web to result in Platonic Truth any more than it would in the hand of a human, it would probably get a lot closer to something humanlike.

I recall about a year how we were discussing basically turning web search into LLM queries, and I remember never being clear whether people meant simply directly querying AIs or turning them loose on the web. The former is what this is testing and is fairly transparently stupid, just by an information theoretic argument that the AIs simply can't contain all the answers to every query in them, they're just not large enough (and really can't be, practically). I've had good results with the latter, when using dedicated AI resources that I'm paying for (not the stuff coming out of the search engines right now, which I find are often quite terrible). Even non-frontier models can do OK when they've got good results sitting right there to look at. Again, the standard I'm applying here isn't that they yield Absolute Truth, but just that when I follow the links back, they basically say what the AI said they did and the summary is reasonable. I wouldn't expect a human to do better in a casual overview, not that the result is perfect.

▲

vstollen 20 minutes ago | parent | next [-]

Can you share what you mean by this?

> when using dedicated AI resources that I'm paying for

Are there API-based search providers that structure their results differently?

▲

afavour 2 hours ago | parent | prev | next [-]

While I agree with what you’re saying the typical AI agent doesn’t say “I’m not totally sure about this, should I search the web?”. It often just spits out a reply based on its knowledge.

▲

simonw 2 hours ago | parent [-]

That was true a year ago, I don't think it's true today. I can't remember the last time I saw Claude or ChatGPT confidently answer a question that they should have searched for instead.

If you watch their reasoning traces they often say things like "this is a well-known historical fact so I don't need to search for it", or more frequently they spit off a bunch of searches.

	▲	aftbit an hour ago \| parent [-]
		[dead]

▲

kostaj 2 hours ago | parent | prev [-]

Two of the five models used (Gemini+Search and Sonar Pro) have retrieval capabilities and used search when classifying the claims. The disagreement between them is still quite significant - 42%.

▲

simonw 2 hours ago | parent [-]

Here are those disagreements:

https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...

One example:

Researchers estimate that the average person ingests about 5 grams of plastic per week, which is approximately the weight of a credit card.

Gemini retrieval: Misleading

Sonar pro: Mostly True

	▲	jeffbee an hour ago \| parent [-]
		Internally the statement is perfectly true: some researchers did estimate this, and the credit card is a fair proxy for a 5g mass. Was the research flagrantly incorrect? Yes. But that does not affect the truth of the statement.