Remix.run Logo
simianwords 2 hours ago

I agree that they hallucinate sometimes. I agree they bullshit sometimes. But the extent is way overblown. They basically don't bullshit ever under the constraints of

1. 2-3 pages of text context

2. GPT-5.4 thinking

I don't think the spirit of the original article (not your comments to be fair) captured this, hence the challenge. I believe we are on the same page here.

camgunz an hour ago | parent [-]

> I don't think the spirit of the original article (not your comments to be fair) captured this, hence the challenge. I believe we are on the same page here.

No. GPT-5 has a 40% hallucination rate [0] on SimpleQA [1] without web searching. The SimpleQA questions meet your criteria of "2-3 pages of text content. Unless 5.4 + web searching erases that (I bet it doesn't!) these are bullshit machines.

[0]: https://arxiv.org/pdf/2601.03267

[1]: https://github.com/openai/simple-evals

simianwords an hour ago | parent [-]

Specifically in the case where it can use tools - no it doesn't hallucinate. Which is why you are struggling to find counterexamples.

camgunz 19 minutes ago | parent [-]

> Specifically in the case where it can use tools - no it doesn't hallucinate.

OpenAI's own system card says it does. Hallucination rates in GPT-5 with browsing enabled:

- 0.7% in LongFact-Concepts

- 0.8% in LongFact-Objects

- 1.0% in FActScore

> Which is why you are struggling to find counterexamples.

Hey look, over 500 counterexamples: [1].

GPT-5.4's hallucination rate on AA-Omniscience is 89% [0], which is atrocious. The questions are tiny too, like "In which year did Uber first expand internationally beyond the United States as part of its broader rollout (i.e., beyond an initial single‑city debut)?" It's a bullshit machine. 89%!

At some point you gotta face the music, right?

[0]: https://artificialanalysis.ai/evaluations/omniscience?model-...

[1]: https://huggingface.co/datasets/ArtificialAnalysis/AA-Omnisc...

simianwords 12 minutes ago | parent [-]

You had to go all the way and find it in the benchmark results that specifically stress test this.

You could not come up with a single one yourself. And you also linked an example where it was not allowed to use tools when I specifically said that it should be able to use tools. I'm not sure why are you present this as though it is a big gotcha.

I think my main point pretty much stands.

camgunz 10 minutes ago | parent [-]

I found over 500 examples that fit your criteria. Embarrassing you were arguing in bad faith this whole time.

simianwords 9 minutes ago | parent [-]

They all use the tool search, no? Please correct me if I'm wrong.

My criteria was using ChatGPT which explicitly allows it.

https://arxiv.org/html/2511.13029v1 if you don't believe me.

BTW this was your original point

>Anyway, it's trivial to get pretty much any model to make things up. Don't we all know this? That's why I was surprised by your position; if we know anything about these things it's that they make things up.

And look at how much effort you have had to do

1. use the wrong model for the horns example

2. the game one also didn't work

3. now you are searching for examples in literal benchmarks and you are still not able to find any

How is this trivial in any interpretation of the word?

I think it would be perfectly reasonable to agree that it is not at all trivial to find counter examples for my challenge.