Wow. They must have had some major breakthrough. Those scores are truly insane. O_O

Models have begun to fairly thoroughly saturate "knowledge" and such, but there are still considerable bumps there

But the _big news_, and the demonstration of their achievement here, are the incredible scores they've racked up here for what's necessary for agentic AI to become widely deployable. t2-bench. Visual comprehension. Computer use. Vending-Bench. The sorts of things that are necessary for AI to move beyond an auto-researching tool, and into the realm where it can actually handle complex tasks in the way that businesses need in order to reap rewards from deploying AI tech.

Will be very interesting to see what papers are published as a result of this, as they have _clearly_ tapped into some new avenues for training models.

And here I was, all wowed, after playing with Grok 4.1 for the past few hours! xD

▲

rvnx 9 hours ago | parent | next [-]

The problem is that we know in advance what is the benchmark, so Humanity's Last Exam for example, it's way easier to optimize your model when you have seen the questions before.

▲

pinko 7 hours ago | parent | next [-]

From https://lastexam.ai/: "The dataset consists of 2,500 challenging questions across over a hundred subjects. We publicly release these questions, while maintaining a private test set of held out questions to assess model overfitting." [emphasis mine]

While the private questions don't seem to be included in the performance results, HLE will presumably flag any LLM that appears to have gamed its scores based on the differential performance on the private questions. Since they haven't yet, I think the scores are relatively trustworthy.

▲

panarky 6 hours ago | parent | next [-]

The jump in ARC-AGI and MathArena suggests Google has solved the data scarcity problem for reasoning, maybe with synthetic data self-play??

This was the primary bottleneck preventing models from tackling novel scientific problems they haven't seen before.

If Gemini 3 Pro has transcended "reading the internet" (knowledge saturation), and made huge progress in "thinking about the internet" (reasoning scaling), then this is a really big deal.

▲

largbae 4 hours ago | parent | prev | next [-]

How do they hold back questions in practice though? These are hosted models. To ask the question is to reveal it to the model team.

▲

Bombthecat 4 hours ago | parent [-]

They pinky swear not to store and use the prompts and data lol

▲

UltraSane 4 hours ago | parent [-]

A legally binding pinky swear LOL

	▲	riku_iki 29 minutes ago \| parent [-]
		with fineprint somewhere on page #67, that there are exceptions.

▲

UltraSane 4 hours ago | parent | prev | next [-]

You have to trust that the LLM provider isn't copying the questions when Humanities Last Exam runs the test.

▲

rvnx 5 hours ago | parent | prev [-]

Seems difficult to believe, considering the number of people who prepare this dataset, who also work(ed) or hold shares in Google or OpenAI, etc.

▲

lubujackson 2 hours ago | parent | prev | next [-]

I don't think any of these companies are that reductive and short-sighted to try to game the system. However, Goodhart's Law comes into play. I am sure they have their own metrics that arr much more detailed than these benchmarks, but the fact remains LLMs will be tuned according to elements that are deterministically measurable.

▲

stego-tech 9 hours ago | parent | prev | next [-]

This. A lot of boosters point to benchmarks as justification of their claims, but any gamer who spent time in the benchmark trenches will know full well that vendors game known tests for better scores, and that said scores aren’t necessarily indicative of superior performance. There’s not a doubt in my mind that AI companies are doing the same.

▲

Feuilles_Mortes 9 hours ago | parent | prev | next [-]

shouldn't we expect that all of the companies are doing this optimization, though? so, back to level playing field.

▲

eldenring 7 hours ago | parent | prev [-]

Its the other way around too, HLE questions were selected adversarially to reduce the scores. I'd guess even if the questions were never released, and new training data was introduced, the scores would improve.

▲

m3kw9 2 hours ago | parent | prev [-]

SWE-Bench Verified | 76.2% | 59.6% | 77.2% | 76.3% is actually insane.