Remix.run Logo
sigmar 8 hours ago

Here is the methodologies for all the benchmarks: https://storage.googleapis.com/deepmind-media/gemini/gemini_...

The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"

>Submit a solution which scores 85% on the ARC-AGI-2 private evaluation set and win $700K. https://arcprize.org/guide#overview

gs17 8 hours ago | parent | next [-]

Interestingly, the title of that PDF calls it "Gemini 3.1 Pro". Guess that's dropping soon.

sigmar 7 hours ago | parent | next [-]

I looked at the file name but not the document title (specifically because I was wondering if this is 3.1). Good spot.

edit: they just removed the reference to "3.1" from the pdf

josalhor 6 hours ago | parent [-]

I think this is 3.1 (3.0 Pro with the RL improv of 3.0 Flash). But they probably decided to market it as Deep Think because why not charge more for it.

WarmWash 6 hours ago | parent [-]

The Deep Think moniker is for parallel compute models though, not long CoT like pro models.

It's possible though that deep think 3 is running 3.1 models under the hood.

staticman2 7 hours ago | parent | prev | next [-]

That's odd considering 3.0 is still labeled a "preview" release.

ainch 3 hours ago | parent | next [-]

I think it'll be 3.1 by the time it's labelled GA - they said after 3.0 launch that they figured out new RL methods for Flash that the Pro model hasn't benefitted from.

6 hours ago | parent | prev [-]
[deleted]
WarmWash 7 hours ago | parent | prev [-]

The rumor was that 3.1 was today's drop

losvedir 6 hours ago | parent [-]

Where are these rumors floating around?

beauzero 5 hours ago | parent [-]

One of many https://x.com/synthwavedd/status/2021983382314660075

riku_iki 7 hours ago | parent | prev [-]

> If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"

They never will do on private set, because it would mean its being leaked to google.