Remix.run Logo
dakolli 7 hours ago

This indicates they want this behavior, they know the person asking the question probably doesn't understand the problem entirely (or why would they be asking), so they'd prefer a confident response, regardless of outcomes, because the point is to sell the technologies competency (and the perception thereof), not the capabilities, to a bunch of people that have no clue what they're talking about.

LLMs will ruin your product, have fun trusting a billionaires thinking machine they swear is capable of replacing your employees if you just pay them 75% of your labor budget.

tedsanders 2 hours ago | parent | next [-]

We don't want hallucinations either, I promise you.

A few biased defenses:

- I'll note that this eval doesn't have web search enabled, but we train our models to use web search in ChatGPT, Codex, and our API. I'd be curious to see hallucination rates with web search on.

- This eval only measures binary attempted vs did not attempt, but doesn't really reward any sort of continuous hedging like "I think it's X, but to be honest I'm not sure."

- On the flip side, GPT-5.5 has the highest accuracy score.

- With any rate over 1% (whether 30% or 70%), you should be verifying anything important anyway.

- On our internal eval made from de-identified ChatGPT prompts that previously elicited hallucinations, we've actually been improving substantially from 5.2 to 5.4 to 5.5. So as always, progress depends on how you measure it.

- Models that ask more clarifying questions will do better on this eval, even if they are just as likely to hallucinate after the clarifying question.

Still, Anthropic has done a great job here and I hope we catch up to them on this eval in the future.

calf 2 hours ago | parent | prev [-]

On ChatGPT 5.3 Plus subscription I find that long informal chats tend to reveal unsatisfactory answers and biases, at this point after 10 rounds of replies I end up having to correct it so much that it starts to agree with my initial arguments full circle. I don't see how this behavior is acceptable or safe for real work. Like are programmers and engineers using LLMs completely differently than I'm doing, because the underlying technology is fundamentally the same.