Remix.run Logo
nubg 4 hours ago

Any benchmarks?

gordonhart 3 hours ago | parent [-]

The main frontier models are all up on https://arcprize.org/tasks

Barely any of them break 0% on any of the demo tasks, with Claude Opus 4.6 coming out on top with a few <3% scores, Gemini 3.1 Pro getting two nonzero scores, and the others (GPT-5.4 and Grok 4.20) getting all 0%

ACCount37 3 hours ago | parent | next [-]

Pre-release, I would have expected Gemini 3.1 Pro to get ahead of Opus 4.6, with GPT-5.4 and Grok 4.20 trailing. Guess I shouldn't have bet against Anthropic.

Not like it's a big lead as of yet. I expect to see more action within the next few months, as people tune the harnesses and better models roll in.

This is far more of a "VLA" task than it is an "LLM" task at its core, but I guess ARC-AGI-3 is making an argument that human intelligence is VLA-shaped.

gordonhart 3 hours ago | parent | next [-]

My broad vibe is that Gemini 3.1 Pro is the best at visual/spatial tasks and oneshotting while Opus 4.6 is the best at path planning. This task leans heavily on both but maybe a little more towards planning so I'm not too shocked that Opus in narrowly on top.

When running, the grids are represented in JSON, so the visual component is nullified but it still requires pretty heavy spatial understanding to parse a big old JSON array of cell values. Given Gemini's image understanding I do wonder if it would perform better with a harness that renders the grid visually.

culi an hour ago | parent | prev [-]

Given the drastic difference in price, I think the chart definitely shows Gemini 3.1 in the best light. Google DeepMind is basically the same thing but they're willing to pay as much electricity as Anthropic is to achieve its benchmarks

thatguymike 2 hours ago | parent | prev [-]

Curious, that doesn't match the graph up on the Leaderboard page? https://arcprize.org/leaderboard

gordonhart 37 minutes ago | parent [-]

The individual task scores are all on public tasks, they still held out a hundred or so private tasks that presumably GPT-5.4 did well on to get its leaderboard position.