Francois here. The scoring metric design choices are detailed in the technical report: https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf - the metric is meant to discount brute-force attempts and to reward solving harder levels instead of the tutorial levels. The formula is inspired by the SPL metric from robotics navigation, it's pretty standard, not a brand new thing.

We tested ~500 humans over 90 minute sessions in SF, with $115-$140 show up fee (then +$5/game solved). A large fraction of testers were unemployed or under-employed. It's not like we tested Stanford grad students. Many AI benchmarks use experts with Ph.D.s as their baseline -- we hire regular folks as our testers.

Each game was seen by 10 people. They were fully solved (all levels cleared) by 2-8 of them, most of the time 5+. Our human baseline is the second best action count, which is considerably less than an optimal first-play (even the #1 human action count is much less than optimal). It is very achievable, and most people on this board would significantly outperform it.

Try the games yourself if you want to get a sense of the difficulty.

> Models can't use more than 5X the steps that a human used

These aren't "steps" but in-game actions. The model can use as much compute or tools as it wants behind the API. Given that models are scored on efficiency compared to humans, the cutoff makes basically no difference on the final score. The cutoff only exists because these runs are incredibly expensive.

> No harness at all and very simplistic prompt

This is explained in the paper. Quoting: "We see general intelligence as the ability to deal with problems that the system was not specifically designed or trained for. This means that the official leaderboard will seek to discount score increases that come from direct targeting of ARC-AGI-3, to the extent possible."

...

"We know that by injecting a high amount of human instructions into a harness, or even hand-crafting harness configuration choices such as which tools to use, it is possible to artificially increase performance on ARC-AGI-3 (without improving performance on any other domain). The purpose of ARC-AGI-3 is not to measure the amount of human intelligence that went into designing an ARC-AGI-3 specific system, but rather to measure the general intelligence of frontier AI systems.

...

"Therefore, we will focus on reporting the performance of systems that have not been specially prepared for ARC-AGI-3, served behind a general-purpose API (representing developer-aware generalization on a new domain as per (8)). This is similar to looking at the performance of a human test-taker walking into our testing center for the first time, with no prior knowledge of ARC-AGI-3. We know such test takers can indeed solve ARC-AGI-3 environments upon first contact, without prior training, without being briefed on solving strategies, and without using external tools."

If it's AGI, it doesn't need human intervention to adapt to a new task. If a harness is needed, it can make its own. If tools are needed, it can chose to bring out these tools.

▲

Imnimo 2 hours ago | parent | next [-]

Suppose you construct a Mechanical Turk AI who plays ARC-AGI-3 by, for each task, randomly selecting one of the human players who attempted it, and scoring them as an AI taking those same actions would be scored. What score does this Turk get? It must be <100% since sometimes the random human will take more steps than the second best, but without knowing whether it's 90% or 50% it's very hard for me to contextualize AI scores on this benchmark.

▲

causal 2 hours ago | parent | prev | next [-]

Thanks, I mostly agree with your approach except for one thing: eyesight feels like a "harness" that humans get to use and LLMs do not.

I'm guessing you did not pass the human testers JSON blobs to work with, and suspect they would also score 0% without the eyesight and visual cortex harness to their reasoning ability.

▲

fchollet 2 hours ago | parent | next [-]

I'm all for testing humans and AI on a fair basis; how about we restrict testing to robots physically coming to our testing center to solve the environments via keyboard / mouse / screen like our human testers? ;-)

(This version of the benchmark would be several orders of magnitude harder wrt current capabilities...)

▲

causal 2 hours ago | parent [-]

Well, yes, and would hand even more of an advantage to humans. My point is that designing a test around human advantages seems odd and orthogonal to measuring AGI.

▲

adgjlsfhk1 an hour ago | parent [-]

The whole point of AGI is "general" intelligence, and for that intelligence to be broadly useful it needs to exist within the context of a human centric world

	▲	causal 40 minutes ago \| parent [-]
		Then why deny it a harness it can also use in a human centric world?

▲

fc417fc802 2 hours ago | parent | prev | next [-]

The human testers were provided with their customary inputs, as were the LLMs. I don't see the issue.

I guess it could be interesting to provide alternative versions that made available various representations of the same data. Still, I'd expect any AGI to be capable of ingesting more or less any plaintext representation interchangeably.

	▲	causal an hour ago \| parent [-]
		The issue is that ARC AGI 3 specifically forbids harnesses that humans get to use.

▲

2 hours ago | parent | prev [-]

[deleted]

▲

blueblisters 2 hours ago | parent | prev | next [-]

I tried ls20 and it was surprisingly fun! Just from a game design POV, these are very well made.

Nit: I didn't see a final score of how many actions I took to complete 7 levels. Also didn't see a place to sign in to see the leaderboard (I did see the sign in prompt).

▲

strongpigeon 2 hours ago | parent | prev | next [-]

Something that I don't understand after reading the technical report is: Why is having access to a python interpreter as part of the harness not allowed (like the Duke harness), but using one hidden behind the model API (as a built-in tool) considered kosher?

	▲	cdetrio 14 minutes ago \| parent [-]
		The Duke harness was specifically designed for these puzzles, that's why they don't want to measure it. My reading of that part in the technical report (models "could be using their own tools behind the model’s API, which is a blackbox"), is that there's no way to prevent it. But from fchollet's comment here, using tools and harnesses is encouraged, as long as they are generic and not arc-agi specific. In that case, the models should be benchmarked by prompting through claude code and codex, rather than the through API (as from the api we only expect raw LLM output, and no tool use).

▲

WarmWash 3 hours ago | parent | prev | next [-]

Maybe this is a neither can confirm or deny thing, but are there systems in place or design decisions made that are meant to surface attempts at benchmark optimizing (benchmaxxing), outside of just having private sets? Something like a heuristic anti-cheat I suppose.

Or perhaps the view is that any gains are good gains? Like studying for a test by leaning on brute memorization is still a non-zero positive gain.

	▲	fchollet 2 hours ago \| parent [-]
		There are no tricks. Our approach to reducing the impact of targeting (without fully eliminating it) is described in the paper.

▲

cdetrio an hour ago | parent | prev | next [-]

Are you prompting the models through their APIs, which are not designed to use tools or harnesses? Or do the "system prompt" results come from prompting into the applications (i.e. claude code, or codex, or even the web front-ends)?

▲

GodelNumbering 2 hours ago | parent | prev [-]

Off topic but I have been following your Twitter for a while and your posts specifically about the nature of intelligence have been a read.