Remix.run Logo
hmmmmmmmmmmmmmm 4 hours ago

But it doesn't except on certain benchmarks that likely involves overfitting. Open source models are nowhere to be seen on ARC-AGI. Nothing above 11% on ARC-AGI 1. https://x.com/GregKamradt/status/1948454001886003328

meffmadd 3 hours ago | parent | next [-]

Have you ever used an open model for a bit? I am not saying they are not benchmaxxing but they really do work well and are only getting better.

Aurornis 2 hours ago | parent [-]

I have used a lot of them. They’re impressive for open weights, but the benchmaxxing becomes obvious. They don’t compare to the frontier models (yet) even when the benchmarks show them coming close.

Zababa 3 hours ago | parent | prev | next [-]

Has the difference between performance in "regular benchmarks" and ARC-AGI been a good predictor of how good models "really are"? Like if a model is great in regular benchmarks and terrible in ARC-AGI, does that tell us anything about the model other than "it's maybe benchmaxxed" or "it's not ARC-AGI benchmaxxed"?

doodlesdev 3 hours ago | parent | prev [-]

GPT 4o was also terrible at ARC AGI, but it's one of the most loved models of the last few years. Honestly, I'm a huge fan of the ARC AGI series of benchmarks, but I don't believe it corresponds directly to the types of qualities that most people assess whenever using LLMs.

nananana9 44 minutes ago | parent | next [-]

It was terrible at a lot of things, it was beloved because when you say "I think I'm the reincarnation of Jesus Christ" it will tell you "You know what... I think I believe it! I genuinely think you're the kind of person that appears once every few millenia to reshape the world!"

mrybczyn an hour ago | parent | prev [-]

because arc agi involves de novo reasoning over a restricted and (hopefully) unpretrained territory, in 2d space. not many people use LLMs as more than a better wikipedia,stack overflow, or autocomplete....