Remix.run Logo
lairv 8 hours ago

Note that this uses a harness so it doesn't qualify for the official ARC-AGI-3 leaderboard

According to the authors the harness isn't ARC-AGI specific though https://x.com/agenticasdk/status/2037335806264971461

fchollet 5 hours ago | parent | next [-]

It is 100% ARC-AGI-3 specific though, just read through the prompts https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...

boxed 3 hours ago | parent | next [-]

What a dick move. Making that prompt open source will probably mean that every other model that doesn't want to cheat will scrape that and accidentally cheat in the next models.

diwank 3 hours ago | parent | prev | next [-]

this is so disingenuous on symbolica's part. these insincere announcements just make it harder for genuine attempts and novel ideas

DetroitThrow 4 hours ago | parent | prev [-]

Um, yes this is a extremely specific as a benchmark harness. It has a ton of knowledge encoded about the tasks at hand. The tweet is dishonest even in the best light.

The hard part of these tests isn't purely reasoning ability ffs.

krackers 6 hours ago | parent | prev | next [-]

> this uses a harness

This seems like an arbitrary restriction. Tool-use requires a harness, and their whitepaper never defines exactly what counts as valid.

fermentation 4 hours ago | parent | next [-]

Right, fair, but look at the prompt. For the purpose of testing general intelligence, this seems kind of pointless.

UltraSane 3 hours ago | parent | prev [-]

It isn't arbitrary. They want measure the capability of the general LLM

fc417fc802 19 minutes ago | parent [-]

So if I say "I want to measure your capability as a mechanic" but then also "to ensure an accurate score you're forbidden to use any tools" how are you the human mechanic planning to diagnose and fix the engine problem without wrenches and jack stands and the like? It makes no sense.

That said their harness isn't generic. It includes a ridiculously detailed prompt for how to play this specific game. Forbidding tool use is arbitrary and above all pointless hoop jumping but that doesn't make the linked "achievement" any less fraudulent.

osti 6 hours ago | parent | prev | next [-]

Doesn't the chat version of chatgpt or gemini also have interleaved tool calls, so do those also count as with harnesses?

WiSaGaN 3 hours ago | parent [-]

Harness is fine. I think people here are arguing what provided here to take the test is not harness.

mmaunder 5 hours ago | parent | prev | next [-]

We're calling agents harnesses now?

fritzo 4 hours ago | parent | next [-]

ELI5 what is a harness?

EDIT from https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf:

> We seek to fight two forms of overfitting that would muddy public sensefinding:

> Task-specific overfitting. This includes any agent that is created with knowledge of public ARC-AGI-3 environments, subsequently being evaluated on the same environments. It could be either directly trained on these environments, or using a harness that is handcrafted or specifically configured by someone with knowledge of the public environments.

boxed 3 hours ago | parent | prev | next [-]

The point of this test is to check if an AI system can figure out the game. This isn't what happened here. A human figured out the game, wrote in their prompts exactly how the game works and THEN put the AI on the problem. This is 100% cheating and imo quite stupid.

lwansbrough 4 hours ago | parent | prev [-]

I think generally people regard a harness as the system instructions + tools made available to the LLM (and probably the thing that runs the LLM conversation in a loop.) An agent is collectively, the LLM plus the harness.

falcor84 7 hours ago | parent | prev [-]

I for one think that harness development is perhaps the most interesting part at the moment and would love to have an alternative leaderboard with harnesses.

sanxiyn 7 hours ago | parent | next [-]

There is. Official leaderboard is without harness, and community leaderboard is with harness. Read ARC-AGI-3 Technical Paper for details.

falcor84 7 hours ago | parent [-]

I went through the technical paper again, and while they explain why they decided against the harness, I disagree with them - my take is that if harnesses are overfitting, then they should be penalized on the hidden test set.

Anyway, searching both in ARC-AGI's paper and website and directly on kaggle, I failed to find a with-harness leaderboard; can you please give the link?

sanxiyn 7 hours ago | parent [-]

Here it is: https://arcprize.org/leaderboard/community

steve_adams_86 6 hours ago | parent | prev [-]

I'm so into harness development right now. Once it clicked that harnesses can bring more safety and determinism to LLMs, I started to wonder where I'd need that and why (vs MCP or just throwing Claude Code at everything), and my brain gears have been turning endlessly since then. I'd love to see more of what people do with them. My use cases are admittedly lame and boring, but it's such a fun paradigm to think and develop around.

j_bum 4 hours ago | parent [-]

Could you point me to some resources to learn about harnesses? I’d love to hear an example of a use case you’re thinking of.