Does the arc-agi-2 score more than doubling in a .1 release indicate benchmark-maxing? Though i dont know what arc-agi-2 actually tests

▲

maxall4 4 hours ago | parent | next [-]

Theoretically, you can’t benchmaxx ARC-AGI, but I too am suspect of such a large improvement, especially since the improvement on other benchmarks is not of the same order.

▲

energy123 3 hours ago | parent | prev | next [-]

Francois Chollet accuses the big labs of targeting the benchmark, yes. It is benchmaxxed.

▲

tasuki 2 hours ago | parent | next [-]

Didn't the same Francois Chollet claim that this was the Real Test of Intelligence? If they target it, perhaps they target... real intelligence?

▲

CamperBob2 3 hours ago | parent | prev [-]

I don't know what he could mean by that, as the whole idea behind ARC-AGI is to "target the benchmark." Got any links that explain further?

	▲	layer8 2 hours ago \| parent [-]
		The fact that ARC-AGI has public and semi-private in addition to private datasets might explain it: https://arcprize.org/arc-agi/2/#dataset-structure

▲

blinding-streak 4 hours ago | parent | prev | next [-]

I assume all the frontier models are benchmaxxing, so it would make sense

▲

boplicity 4 hours ago | parent | prev [-]

Benchmark maxing could be interpreted as benchmarks actually being a design framework? I'm sure there are pitfalls to this, but it's not necessarily bad either.