Remix.run Logo
kingstnap 5 days ago

It has a SimpleQA score of 69%, a benchmark that tests knowledge on extremely niche facts, that's actually ridiculously high (Gemini 2.5 *Pro* had 55%) and reflects either training on the test set or some sort of cracked way to pack a ton of parametric knowledge into a Flash Model.

I'm speculating but Google might have figured out some training magic trick to balance out the information storage in model capacity. That or this flash model has huge number of parameters or something.

scrollop 5 days ago | parent | next [-]

Also

https://artificialanalysis.ai/evaluations/omniscience

Prepare to be amazed

albumen 5 days ago | parent | next [-]

I’m amazed by how much Gemini 3 flash hallucinates; it performs poorly in that metric (along with lots of other models). In the Hallucination Rate vs. AA-Omniscience Index chart, it’s not in the most desirable quadrant; GPT-5.1 (high), opus 4.5 and 4.5 haiku are.

Can someone explain how Gemini 3 pro/flash then do so well then in the overall Omniscience: Knowledge and Hallucination Benchmark?

wasabi991011 4 days ago | parent | next [-]

Hallucination rate is hallucination/(hallucination+partial+ignored), while omniscience is correct-hallucination.

One hypothesis is that gemini 3 flash refuses to answer when unsuure less often than other models, but when sure is also more likely to be correct. This is consistent with it having the best accuracy score.

Wyverald 5 days ago | parent | prev [-]

I'm a total noob here, but just pointing out that Omniscience Index is roughly "Accuracy - Hallucination Rate". So it simply means that their Accuracy was very high.

> In the Hallucination Rate vs. AA-Omniscience Index chart, it’s not in the most desirable quadrant

This doesn't mean much. As long as Gemini 3 has a high hallucination rate (higher than at least 50% others), it's not going to be in the most desirable quadrant by definition.

For example, let's say a model answers 99 out of 100 questions correctly. The 1 wrong answer it produces is a hallucination (i.e. confidently wrong). This amazing model would have a 100% hallucination rate as defined here, and thus not be in the most desirable quadrant. But it should still have a very high Omniscience Index.

andy12_ 4 days ago | parent | prev [-]

I'm confused about the "Accuracy vs Cost" section. Why is Gemini 3 Pro so cheap? It's basically the cheapest model in the graph (sans Llama 4 and Mistral Large 3) by a wide margin, even compared to Gemini 3 Flash. Is that an error?

noelsusman 4 days ago | parent [-]

It's not an error, Gemini 3 Pro is just somehow able to complete the benchmark while using way fewer tokens than any other model. Gemini 3 Flash is way cheaper per token, but it also tends to generate a ton of reasoning tokens to get to its answer.

They have a similar chart that compares results across all their benchmarks vs. cost and 3 Flash is about half as expensive as 3 Pro there despite being four times cheaper per token.

int_19h 5 days ago | parent | prev | next [-]

> reflects either training on the test set or some sort of cracked way to pack a ton of parametric knowledge into a Flash Model

That's what MoE is for. It might be that with their TPUs, they can afford lots of params, just so long as the activated subset for each token is small enough to maintain throughput.

tanh 5 days ago | parent | prev | next [-]

This will be fantastic for voice. I presume Apple will use it

GaggiX 5 days ago | parent | prev | next [-]

>or some sort of cracked way to pack a ton of parametric knowledge into a Flash Model.

More experts with a lower pertentage of active ones -> more sparsity.

leumon 5 days ago | parent | prev [-]

Or could it be that it's using tool calls in reasoning (e.g. a google search)?