new | show | ask | jobs Github

camgunz 2 hours ago

> Specifically in the case where it can use tools - no it doesn't hallucinate.

OpenAI's own system card says it does. Hallucination rates in GPT-5 with browsing enabled:

- 0.7% in LongFact-Concepts

- 0.8% in LongFact-Objects

- 1.0% in FActScore

> Which is why you are struggling to find counterexamples.

Hey look, over 500 counterexamples: [1].

GPT-5.4's hallucination rate on AA-Omniscience is 89% [0], which is atrocious. The questions are tiny too, like "In which year did Uber first expand internationally beyond the United States as part of its broader rollout (i.e., beyond an initial single‑city debut)?" It's a bullshit machine. 89%!

At some point you gotta face the music, right?

[0]: https://artificialanalysis.ai/evaluations/omniscience?model-...

[1]: https://huggingface.co/datasets/ArtificialAnalysis/AA-Omnisc...

▲

simianwords 2 hours ago | parent [-]

You had to go all the way and find it in the benchmark results that specifically stress test this.

You could not come up with a single one yourself. And you also linked an example where it was not allowed to use tools when I specifically said that it should be able to use tools. I'm not sure why are you present this as though it is a big gotcha.

I think my main point pretty much stands.

▲

camgunz 2 hours ago | parent [-]

I found over 500 examples that fit your criteria. Embarrassing you were arguing in bad faith this whole time.

▲

simianwords 2 hours ago | parent [-]

They all use the tool search, no? Please correct me if I'm wrong.

My criteria was using ChatGPT which explicitly allows it.

https://arxiv.org/html/2511.13029v1 if you don't believe me.

BTW this was your original point

>Anyway, it's trivial to get pretty much any model to make things up. Don't we all know this? That's why I was surprised by your position; if we know anything about these things it's that they make things up.

And look at how much effort you have had to do

1. use the wrong model for the horns example

2. the game one also didn't work

3. now you are searching for examples in literal benchmarks and you are still not able to find any

How is this trivial in any interpretation of the word?

I think it would be perfectly reasonable to agree that it is not at all trivial to find counter examples for my challenge.

	▲	camgunz an hour ago \| parent [-]
		I've got about 20 minutes in this; mostly I've been reading wallstreetbets at the Shake Shack bar in the Boston airport. I'm happy to post this over and over again until you engage w/ it: > I found over 500 examples that fit your criteria.