The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"

>Submit a solution which scores 85% on the ARC-AGI-2 private evaluation set and win $700K. https://arcprize.org/guide#overview

▲

gs17 8 hours ago | parent | next [-]

Interestingly, the title of that PDF calls it "Gemini 3.1 Pro". Guess that's dropping soon.

▲

sigmar 7 hours ago | parent | next [-]

I looked at the file name but not the document title (specifically because I was wondering if this is 3.1). Good spot.

edit: they just removed the reference to "3.1" from the pdf

▲

josalhor 6 hours ago | parent [-]

I think this is 3.1 (3.0 Pro with the RL improv of 3.0 Flash). But they probably decided to market it as Deep Think because why not charge more for it.

	▲	WarmWash 6 hours ago \| parent [-]
		The Deep Think moniker is for parallel compute models though, not long CoT like pro models. It's possible though that deep think 3 is running 3.1 models under the hood.

▲

staticman2 7 hours ago | parent | prev | next [-]

That's odd considering 3.0 is still labeled a "preview" release.

	▲	ainch 3 hours ago \| parent \| next [-]
		I think it'll be 3.1 by the time it's labelled GA - they said after 3.0 launch that they figured out new RL methods for Flash that the Pro model hasn't benefitted from.
	▲	6 hours ago \| parent \| prev [-]
		[deleted]

▲

WarmWash 7 hours ago | parent | prev [-]

The rumor was that 3.1 was today's drop

▲

losvedir 6 hours ago | parent [-]

Where are these rumors floating around?

	▲	beauzero 5 hours ago \| parent [-]
		One of many https://x.com/synthwavedd/status/2021983382314660075

▲

riku_iki 7 hours ago | parent | prev [-]

> If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"

They never will do on private set, because it would mean its being leaked to google.