| ▲ | ripbozo 4 hours ago | ||||||||||||||||||||||
Does the arc-agi-2 score more than doubling in a .1 release indicate benchmark-maxing? Though i dont know what arc-agi-2 actually tests | |||||||||||||||||||||||
| ▲ | maxall4 4 hours ago | parent | next [-] | ||||||||||||||||||||||
Theoretically, you can’t benchmaxx ARC-AGI, but I too am suspect of such a large improvement, especially since the improvement on other benchmarks is not of the same order. | |||||||||||||||||||||||
| ▲ | energy123 3 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
Francois Chollet accuses the big labs of targeting the benchmark, yes. It is benchmaxxed. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | blinding-streak 4 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
I assume all the frontier models are benchmaxxing, so it would make sense | |||||||||||||||||||||||
| ▲ | boplicity 4 hours ago | parent | prev [-] | ||||||||||||||||||||||
Benchmark maxing could be interpreted as benchmarks actually being a design framework? I'm sure there are pitfalls to this, but it's not necessarily bad either. | |||||||||||||||||||||||