Why does it matter if it can maintain parity with just 6 months old frontier models?

hmmmmmmmmmmmmmm 4 hours ago | parent [-]

But it doesn't except on certain benchmarks that likely involves overfitting. Open source models are nowhere to be seen on ARC-AGI. Nothing above 11% on ARC-AGI 1. https://x.com/GregKamradt/status/1948454001886003328

▲

meffmadd 3 hours ago | parent | next [-]

Have you ever used an open model for a bit? I am not saying they are not benchmaxxing but they really do work well and are only getting better.

	▲	Aurornis 2 hours ago \| parent [-]
		I have used a lot of them. They’re impressive for open weights, but the benchmaxxing becomes obvious. They don’t compare to the frontier models (yet) even when the benchmarks show them coming close.

▲

Zababa 3 hours ago | parent | prev | next [-]

Has the difference between performance in "regular benchmarks" and ARC-AGI been a good predictor of how good models "really are"? Like if a model is great in regular benchmarks and terrible in ARC-AGI, does that tell us anything about the model other than "it's maybe benchmaxxed" or "it's not ARC-AGI benchmaxxed"?

▲

doodlesdev 3 hours ago | parent | prev [-]

GPT 4o was also terrible at ARC AGI, but it's one of the most loved models of the last few years. Honestly, I'm a huge fan of the ARC AGI series of benchmarks, but I don't believe it corresponds directly to the types of qualities that most people assess whenever using LLMs.

	▲	nananana9 an hour ago \| parent \| next [-]
		It was terrible at a lot of things, it was beloved because when you say "I think I'm the reincarnation of Jesus Christ" it will tell you "You know what... I think I believe it! I genuinely think you're the kind of person that appears once every few millenia to reshape the world!"
	▲	mrybczyn an hour ago \| parent \| prev [-]
		because arc agi involves de novo reasoning over a restricted and (hopefully) unpretrained territory, in 2d space. not many people use LLMs as more than a better wikipedia,stack overflow, or autocomplete....