So ARC-AGI was released in 2019. That's been solved, then there was ARC-AGI-2, and now there's ARC-AGI-3. What is even the point? Will ARC-AGI-26 hit the front page of Hacker News in 2057 ?

▲

muskstinks 4 hours ago | parent | next [-]

This is clear AGI progress. It should show you, that AI is not sleeping, it gets better and you should use this as a signal that you should take this topic serious.

▲

applfanboysbgon 4 hours ago | parent [-]

Labelling a test "AGI" does not show AGI progress any more than labelling a cpu "AGI" makes it so. It might show that AI tools are improving but it does not necessarily follow that tools improving = AGI progress if you're on the completely wrong trail.

	▲	muskstinks 3 hours ago \| parent \| next [-]
		The transfer of knowledge required here is that a ARC-AGI-3 is now necessary and adds another dimension of capability. These 'tests' are not labeled AGI by magic but because they are designed specificly for testing certain things a question answer test cant solve. Gemini and OpenAI are at 80-90% at ARC-AGI-2 and its quite interesting to see the difference of challange between 2 and 3. AGI progress means btw. general. So every additional dimension an agent can solve pushes that agent to be more general.
	▲	zarzavat 3 hours ago \| parent \| prev \| next [-]
		Any test that humans can pass and AIs cannot is a stepping stone on the way to AGI. When you run out of such tests then it's evidence that you have reached AGI. The point of these tests is to define AGI objectively as the inability to devise tests that humans have superiority on.
	▲	3 hours ago \| parent \| prev [-]
		[deleted]

▲

gordonhart 4 hours ago | parent | prev | next [-]

The point is still to test frontier models at the limit of their capabilities, regardless of how it's branded. If we're still capable of doing so in 2057 I'll upvote the ARC-AGI-26 launch post!

▲

futureshock 3 hours ago | parent | prev | next [-]

Well yes, that is exactly the point! The very purpose of the ARC AGI benchmarks is to find a pure reasoning task that humans are very good at and AI is very bad at. Companies then race each other to get a high score on that benchmark. Sure there’s going to be a lot of “studying for the test” and benchmaxing, but once a benchmark gets close to being saturated, ARC releases a new benchmark with a new task the AI is terrible at. This will rinse and repeat till ARC can find no reasoning task that AI cannot do that a human could. At that point we will effectively have AGI.

I believe the CEO of ARC has said they expect us to get to ARC-AGI-7 before declaring AGI.

▲

didibus 3 hours ago | parent | prev | next [-]

It helps the model makers have a harness to optimize for in their next model version.

They'll specifically work to pass the next version of ARC-AGI, by evaluating what kind of dataset is missing that if they trained on would have their model pass the new version.

They ideally don't directly train on the ARC-AGI itself, but they can train in similar problems/datasets to hope to learn the skills that than transfer to also solving for the real ARC-AGI.

The point is that, a new version of ARC-AGI should help the next model be smarter.

▲

tibbar 4 hours ago | parent | prev | next [-]

The point is that ideally the models keep improving until they can solve problems people care about. Which is already partly true, but there are lots of problems that are still out of reach.

▲

minimaxir 4 hours ago | parent | prev | next [-]

It's semvar.

▲

refulgentis 3 hours ago | parent | prev [-]

You’re absolutely right to point it out.

LLMs weren’t supposed to solve 1, they did, so we got 2 and it really wasn’t supposed to be solvable by LLMs. It was, and as soon as it started creeping up we start hearing about 3: It’s Really AGI This Time.

I don’t know what Francois’ underlying story is, other than he hasn’t told it yet.

One of a few moments that confirmed it for me was when he was Just Asking Questions re: if Anthropic still used SaaS a month ago, which was an odd conflation of a hyperbolic reading of a hyperbolic stonk market bro narrative (SaaS is dead) and low-info on LLMs (Claude’s not the only one that can code) and addressing the wrong audience (if you follow Francois, you’re likely neither of those poles)

At this point I’d be more interested in a write up from Francois about where he is intellectually than an LLM that got 100% on this. It’s like when Yann would repeat endlessly that LLMs are definitionally dumber than housecats. Maybe, in some specific way that makes sense to you. You’re brilliant. But there’s a translation gap between Mount Olympus and us plebes, and you’re brilliant enough to know that too. So it comes across as trolling and boring.