| ▲ | fchollet 3 hours ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Francois here. The scoring metric design choices are detailed in the technical report: https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf - the metric is meant to discount brute-force attempts and to reward solving harder levels instead of the tutorial levels. The formula is inspired by the SPL metric from robotics navigation, it's pretty standard, not a brand new thing. We tested ~500 humans over 90 minute sessions in SF, with $115-$140 show up fee (then +$5/game solved). A large fraction of testers were unemployed or under-employed. It's not like we tested Stanford grad students. Many AI benchmarks use experts with Ph.D.s as their baseline -- we hire regular folks as our testers. Each game was seen by 10 people. They were fully solved (all levels cleared) by 2-8 of them, most of the time 5+. Our human baseline is the second best action count, which is considerably less than an optimal first-play (even the #1 human action count is much less than optimal). It is very achievable, and most people on this board would significantly outperform it. Try the games yourself if you want to get a sense of the difficulty. > Models can't use more than 5X the steps that a human used These aren't "steps" but in-game actions. The model can use as much compute or tools as it wants behind the API. Given that models are scored on efficiency compared to humans, the cutoff makes basically no difference on the final score. The cutoff only exists because these runs are incredibly expensive. > No harness at all and very simplistic prompt This is explained in the paper. Quoting: "We see general intelligence as the ability to deal with problems that the system was not specifically designed or trained for. This means that the official leaderboard will seek to discount score increases that come from direct targeting of ARC-AGI-3, to the extent possible." ... "We know that by injecting a high amount of human instructions into a harness, or even hand-crafting harness configuration choices such as which tools to use, it is possible to artificially increase performance on ARC-AGI-3 (without improving performance on any other domain). The purpose of ARC-AGI-3 is not to measure the amount of human intelligence that went into designing an ARC-AGI-3 specific system, but rather to measure the general intelligence of frontier AI systems. ... "Therefore, we will focus on reporting the performance of systems that have not been specially prepared for ARC-AGI-3, served behind a general-purpose API (representing developer-aware generalization on a new domain as per (8)). This is similar to looking at the performance of a human test-taker walking into our testing center for the first time, with no prior knowledge of ARC-AGI-3. We know such test takers can indeed solve ARC-AGI-3 environments upon first contact, without prior training, without being briefed on solving strategies, and without using external tools." If it's AGI, it doesn't need human intervention to adapt to a new task. If a harness is needed, it can make its own. If tools are needed, it can chose to bring out these tools. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | Imnimo 2 hours ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Suppose you construct a Mechanical Turk AI who plays ARC-AGI-3 by, for each task, randomly selecting one of the human players who attempted it, and scoring them as an AI taking those same actions would be scored. What score does this Turk get? It must be <100% since sometimes the random human will take more steps than the second best, but without knowing whether it's 90% or 50% it's very hard for me to contextualize AI scores on this benchmark. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | causal 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Thanks, I mostly agree with your approach except for one thing: eyesight feels like a "harness" that humans get to use and LLMs do not. I'm guessing you did not pass the human testers JSON blobs to work with, and suspect they would also score 0% without the eyesight and visual cortex harness to their reasoning ability. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | blueblisters 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
I tried ls20 and it was surprisingly fun! Just from a game design POV, these are very well made. Nit: I didn't see a final score of how many actions I took to complete 7 levels. Also didn't see a place to sign in to see the leaderboard (I did see the sign in prompt). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | strongpigeon 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Something that I don't understand after reading the technical report is: Why is having access to a python interpreter as part of the harness not allowed (like the Duke harness), but using one hidden behind the model API (as a built-in tool) considered kosher? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | WarmWash 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Maybe this is a neither can confirm or deny thing, but are there systems in place or design decisions made that are meant to surface attempts at benchmark optimizing (benchmaxxing), outside of just having private sets? Something like a heuristic anti-cheat I suppose. Or perhaps the view is that any gains are good gains? Like studying for a test by leaning on brute memorization is still a non-zero positive gain. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | cdetrio an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Are you prompting the models through their APIs, which are not designed to use tools or harnesses? Or do the "system prompt" results come from prompting into the applications (i.e. claude code, or codex, or even the web front-ends)? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | GodelNumbering 2 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Off topic but I have been following your Twitter for a while and your posts specifically about the nature of intelligence have been a read. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||