Weren't we barely scraping 1-10% on this with state of the art models a year ago and it was considered that this is the final boss, ie solve this and its almost AGI-like?

I ask because I cannot distinguish all the benchmarks by heart.

▲

modeless 4 hours ago | parent | next [-]

François Chollet, creator of ARC-AGI, has consistently said that solving the benchmark does not mean we have AGI. It has always been meant as a stepping stone to encourage progress in the correct direction rather than as an indicator of reaching the destination. That's why he is working on ARC-AGI-3 (to be released in a few weeks) and ARC-AGI-4.

His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.

▲

beklein 3 hours ago | parent | next [-]

https://x.com/fchollet/status/2022036543582638517

	▲	joelthelion 2 hours ago \| parent [-]
		Do opus 4.6 or gemini deep think really use test time adaptation ? How does it work in practice?

▲

mapontosevenths 2 hours ago | parent | prev | next [-]

> His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.

That is the best definition I've yet to read. If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.

Thats said, I'm reminded of the impossible voting tests they used to give black people to prevent them from voting. We dont ask nearly so much proof from a human, we take their word for it. On the few occasions we did ask for proof it inevitably led to horrific abuse.

Edit: The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.

▲

estearum 2 hours ago | parent | next [-]

> If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.

This is not a good test.

A dog won't claim to be conscious but clearly is, despite you not being able to prove one way or the other.

GPT-3 will claim to be conscious and (probably) isn't, despite you not being able to prove one way or the other.

	▲	dullcrisp an hour ago \| parent [-]
		An LLM will claim whatever you tell it to claim. (In fact this Hacker News comment is also conscious.) A dog won’t even claim to be a good boy.

▲

WarmWash 2 hours ago | parent | prev | next [-]

>because we can no longer find tasks that are feasible for normal humans but unsolved by AI.

"Answer "I don't know" if you don't know an answer to one of the questions"

	▲	mrandish 7 minutes ago \| parent [-]
		I've been surprised how difficult it is for LLMs to simply answer "I don't know." It also seems oddly difficult for them to 'right-size' the length and depth of their answers based on prior context. I either have to give it a fixed length limit or put up with exhaustive answers.

▲

sva_ 2 hours ago | parent | prev | next [-]

> Edit: The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.

I think being better at this particular benchmark does not imply they're 'smarter'.

▲

criddell an hour ago | parent | prev | next [-]

> The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.

Maybe it's testing the wrong things then. Even those of use who are merely average can do lots of things that machines don't seem to be very good at.

I think ability to learn should be a core part of any AGI. Take a toddler who has never seen anybody doing laundry before and you can teach them in a few minutes how to fold a t-shirt. Where are the dumb machines that can be taught?

▲

woah 2 hours ago | parent | prev [-]

> If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.

Can you "prove" that GPT2 isn't concious?

	▲	mapontosevenths an hour ago \| parent [-]
		If we equate self awareness with consciousness then yes. Several papers have now shown that SOTA models have self awareness of at least a limited sort. [0][1] As far as I'm aware no one has ever proven that for GPT 2, but the methodology for testing it is available if you're interested. [0]https://arxiv.org/pdf/2501.11120 [1]https://transformer-circuits.pub/2025/introspection/index.ht...

▲

hmmmmmmmmmmmmmm 3 hours ago | parent | prev [-]

I don't think the creator believes ARC3 can't be solved but rather that it can't be solved "efficiently" and >$13 per task for ARC2 is certainly not efficient.

But at this rate, the people who talk about the goal posts shifting even once we achieve AGI may end up correct, though I don't think this benchmark is particularly great either.

▲

fishpham 5 hours ago | parent | prev | next [-]

Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models)

▲

layer8 4 hours ago | parent | next [-]

Isn’t the point of ARC that you can’t train against it? Or doesn’t it achieve that goal anymore somehow?

▲

egeozcan 3 hours ago | parent | next [-]

How can you make sure of that? AFAIK, these SOTA models run exclusively on their developers hardware. So any test, any benchmark, anything you do, does leak per definition. Considering the nature of us humans and the typical prisoners dilemma, I don't see how they wouldn't focus on improving benchmarks even when it gets a bit... shady?

I tell this as a person who really enjoys AI by the way.

	▲	mrandish 13 minutes ago \| parent \| next [-]
		> does leak per definition. As a measure focused solely on fluid intelligence, learning novel tasks and test-time adaptability, ARC-AGI was specifically designed to be resistant to pre-training - for example, unlike many mathematical and programming test questions, ARC-AGI problems don't have first order patterns which can be learned to solve a different ARC-AGI problem. The ARC non-profit foundation has private versions of their tests which are never released and only the ARC can administer. There are also public versions and semi-public sets for labs to do their own pre-tests. But a lab self-testing on ARC-AGI can be susceptible to leaks or benchmaxing, which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal. IMHO, ARC-AGI is a unique test that's different than any other AI benchmark in a significant way. It's worth spending a few minutes learning about why: https://arcprize.org/arc-agi.
	▲	WarmWash 2 hours ago \| parent \| prev \| next [-]
		Because the gains from spending time improving the model overall outweigh the gains from spending time individually training on benchmarks. The pelican benchmark is a good example, because it's been representative of models ability to generate SVGs, not just pelicans on bikes.
	▲	2 hours ago \| parent \| prev [-]
		[deleted]

▲

theywillnvrknw 4 hours ago | parent | prev | next [-]

* that you weren't supposed to be able to

▲

3 hours ago | parent | prev [-]

[deleted]

▲

jstummbillig 4 hours ago | parent | prev | next [-]

Could it also be that the models are just a lot better than a year ago?

▲

bigbadfeline 2 hours ago | parent [-]

> Could it also be that the models are just a lot better than a year ago?

No, the proof is in the pudding.

After AI we're having higher prices, higher deficits and lower standard of living. Electricity, computers and everything else costs more. "Doing better" can only be justified by that real benchmark.

If Gemini 3 DT was better we would have falling prices of electricity and everything else at least until they get to pre-2019 levels.

▲

ctoth 2 hours ago | parent | next [-]

> If Gemini 3 DT was better we would have falling prices of electricity and everything else at least

Man, I've seen some maintenance folks down on the field before working on them goalposts but I'm pretty sure this is the first time I saw aliens from another Universe literally teleport in, grab the goalposts, and teleport out.

▲

WarmWash an hour ago | parent | prev [-]

You might call me crazy, but at least in 2024, consumers spent ~1% less of their income on expenses than 2019[2], which suggests that 2024 is more affordable than 2019.

This is from the BLS consumer survey report released in dec[1]

[1]https://www.bls.gov/news.release/cesan.nr0.htm

[2]https://www.bls.gov/opub/reports/consumer-expenditures/2019/

Prices are never going back to 2019 numbers though

▲

gowld an hour ago | parent [-]

That's an improper analysis.

First off, it's dollar-averaging every category, so it's not "% of income", which varies based on unit income.

Second, I could commit to spending my entire life with constant spending (optionally inflation adjusted, optionally as a % of income), by adusting quality of goods and service I purchase. So the total spending % is not a measure of affordability.

	▲	WarmWash 17 minutes ago \| parent [-]
		Almost everyone lifestyle ratchets, so the handful that actually downgrade their living rather than increase spending would be tiny. This part of a wider trend too, where economic stats don't align with what people are saying. Which is most likley explained by the economic anomaly of the pandemic skewing peoples perceptions.

▲

XenophileJKO 4 hours ago | parent | prev | next [-]

https://chatgpt.com/s/m_698e2077cfcc81919ffbbc3d7cccd7b3

▲

aleph_minus_one 4 hours ago | parent [-]

I don't understand what you want to tell us with this image.

	▲	fragmede 3 hours ago \| parent [-]
		they're accusing GGP of moving the goalposts.

▲

olalonde 4 hours ago | parent | prev [-]

Would be cool to have a benchmark with actually unsolved math and science questions, although I suspect models are still quite a long way from that level.

	▲	gowld an hour ago \| parent \| next [-]
		Does folding a protein count? How about increasing performance at Go?
	▲	4 hours ago \| parent \| prev [-]
		[deleted]

▲

verdverm 5 hours ago | parent | prev [-]

Here's a good thread over 1+ month, as each model comes out

https://bsky.app/profile/pekka.bsky.social/post/3meokmizvt22...

tl;dr - Pekka says Arc-AGI-2 is now toast as a benchmark

▲

Aperocky 4 hours ago | parent [-]

If you look at the problem space it is easy to see why it's toast, maybe there's intelligence in there, but hardly general.

▲

tasuki an hour ago | parent | next [-]

> maybe there's intelligence in there, but hardly general.

Of course. Just as our human intelligence isn't general.

▲

verdverm 4 hours ago | parent | prev [-]

the best way I've seen this describes is "spikey" intelligence, really good at some points, those make the spikes

humans are the same way, we all have a unique spike pattern, interests and talents

ai are effectively the same spikes across instances, if simplified. I could argue self driving vs chatbots vs world models vs game playing might constitute enough variation. I would not say the same of Gemini vs Claude vs ... (instances), that's where I see "spikey clones"

▲

Aperocky 4 hours ago | parent [-]

You can get more spiky with AIs, whereas with human brain we are more hard wired.

So maybe we are forced to be more balanced and general whereas AI don't have to.

▲

verdverm 4 hours ago | parent [-]

I suspect the non-spikey part is the more interesting comparison

Why is it so easy for me to open the car door, get in, close the door, buckle up. You can do this in the dark and without looking.

There are an infinite number of little things like this you think zero about, take near zero energy, yet which are extremely hard for Ai

	▲	gowld an hour ago \| parent [-]
		You are asking a robotics question, not an AI question. Robotics is more and less than AI. Boston Dynamics robots are getting quite near your benchmark.