I am personally impressed by the continued improvement in ARC-AGI-2, where Gemini 3 got 31.1% (vs ChatGPT 5.1's 17.6%). To me this is the kind of problem that does not lend itself well to LLMs - many of the puzzles test the kind of thing that humans intuit because of millions of years of evolution, but these concepts do not necessarily appear in written form (or when they do, it's not clear how they connect to specific ARC puzzles).

The fact that these models can keep getting better at this task given the setup of training is mind-boggling to me.

The ARC puzzles in question: https://arcprize.org/arc-agi/2/

▲

stephc_int13 4 hours ago | parent | next [-]

What I would do if I was in the position of a large company in this space is to arrange an internal team to create an ARC replica, covering very similar puzzles and use that as part of the training.

Ultimately, most benchmarks can be gamed and their real utility is thus short-lived.

But I think this is also fair to use any means to beat it.

▲

tylervigen 4 hours ago | parent | next [-]

I agree that for any given test, you could build a specific pipeline to optimize for that test. I supposed that's why it is helpful to have many tests.

However, many people have worked hard to optimize tools specifically for ARC over many years, and it's proven to be a particularly hard test to optimize for. This is why I find it so interesting that LLMs can do it well at all, regardless of whether tests like it are included in training.

▲

stephc_int13 an hour ago | parent [-]

The real strength of current neural nets/transformers relies on huge datasets.

ARC do not provide this kind of dataset, only a small public one and a private one where they do the benchmarks.

Building your own large private ARC set does not seem too difficult if you have enough resources.

	▲	an hour ago \| parent [-]
		[deleted]

▲

AstroBen an hour ago | parent | prev | next [-]

Is "good at benchmarks instead of real world tasks" really something to optimize for? What does this achieve? Surely people would be initially impressed, try it out, be underwhelmed and then move on. That's not great for Google

	▲	stephc_int13 an hour ago \| parent \| next [-]
		Benchmarks are intended as proxy for real usage, and they are often useful to incrementally improve a system, especially when the end-goal is not well-defined. The trick is to not put more value in the score than what it is.
	▲	spprashant an hour ago \| parent \| prev [-]
		Initial impressions are currently worth a lot. In the long run I think the moat will dissolve, but currently its a race to lock-in users to your model and make switching costs high.

▲

2 hours ago | parent | prev | next [-]

[deleted]

▲

simpsond 4 hours ago | parent | prev [-]

Humans study for tests. They just tend to forget.

▲

grantpitt 5 hours ago | parent | prev | next [-]

Agreed, it also leads performance on arc-agi-1. Here's the leaderboard where you can toggle between arc-agi-1 and 2: https://arcprize.org/leaderboard

▲

tylervigen 3 hours ago | parent | prev | next [-]

This comment was moved from another thread. The original thread included a benchmark chart with ARC performance: https://blog.google/products/gemini/gemini-3/#gemini-3

▲

HarHarVeryFunny 2 hours ago | parent | prev [-]

There's a good chance Gemini 3 was trained on ARG-AGI problems, unless they state otherwise.