new | show | ask | jobs Github

silvertaza 6 hours ago

Still huge hallucination rate, unfortunately at 86%. To compare, Opus sits at 36%.

Source: https://artificialanalysis.ai/models?omniscience=omniscience...

▲

dubcanada 5 hours ago | parent | next [-]

grok is 17%? And that's the lowest, most models are like 80%+?

While hallucination is probably closer to 100% depending on the question. This benchmark makes no sense.

▲

elAhmo 5 hours ago | parent [-]

No one serious uses grok.

	▲	ajdegol 4 hours ago \| parent \| next [-]
		@grok is this true?
	▲	RALaBarge 3 hours ago \| parent \| prev [-]
		YMMV but Grok 4.1 Fast can usually find via static analysis a few things that other models dont seem to catch with the same prompt

▲

simianwords 6 hours ago | parent | prev | next [-]

There's something off with this because Haiku should not be that good.

	▲	rattray 19 minutes ago \| parent \| next [-]
		I've been very curious about that too. I wonder if it's actually much better at admitting when it doesn't know something, because it thinks it's a "dumber model". But I haven't played with this at all myself.
	▲	jwpapi 5 hours ago \| parent \| prev [-]
		The hallucination benchmark is hallucinating

▲

dakolli 5 hours ago | parent | prev [-]

This indicates they want this behavior, they know the person asking the question probably doesn't understand the problem entirely (or why would they be asking), so they'd prefer a confident response, regardless of outcomes, because the point is to sell the technologies competency (and the perception thereof), not the capabilities, to a bunch of people that have no clue what they're talking about.

LLMs will ruin your product, have fun trusting a billionaires thinking machine they swear is capable of replacing your employees if you just pay them 75% of your labor budget.

	▲	tedsanders 30 minutes ago \| parent \| next [-]
		We don't want hallucinations either, I promise you. A few biased defenses: - I'll note that this eval doesn't have web search enabled, but we train our models to use web search in ChatGPT, Codex, and our API. I'd be curious to see hallucination rates with web search on. - This eval only measures binary attempted vs did not attempt, but doesn't really reward any sort of continuous hedging like "I think it's X, but to be honest I'm not sure." - On the flip side, GPT-5.5 has the highest accuracy score. - With any rate over 1% (whether 30% or 70%), you should be verifying anything important anyway. - On our internal eval made from de-identified ChatGPT prompts that previously elicited hallucinations, we've actually been improving substantially from 5.2 to 5.4 to 5.5. So as always, progress depends on how you measure it. Still, Anthropic has done a great job here and I hope we catch up to them on this eval in the future.
	▲	calf 29 minutes ago \| parent \| prev [-]
		On ChatGPT 5.3 Plus subscription I find that long informal chats tend to reveal unsatisfactory answers and biases, at this point after 10 rounds of replies I end up having to correct it so much that it starts to agree with my initial arguments full circle. I don't see how this behavior is acceptable or safe for real work. Like are programmers and engineers using LLMs completely differently than I'm doing, because the underlying technology is fundamentally the same.