But is arc-agi really that useful though? Nowadays it seems to me that it's just another benchmark that needs to be specifically trained for. Maybe the Chinese models just didn't focus on it as much.

▲

sdenton4 2 days ago | parent | next [-]

Doing great on public datasets and underperforming on private benchmarks is not a good look.

▲

Deegy 2 days ago | parent [-]

Is it though? Do we still have the expectation that LLMs will eventually be able to solve problems they haven't seen before? Or do we just want the most accurate auto complete at the cheapest price at this point?

	▲	sdenton4 a day ago \| parent [-]
		It indicates that there's a good chance that they have trained on the test set, making the eval scores useless. Even if you have given up on the dream of generalization entirely, you can't meaningfully compare models which have trained on test to those which have not.

▲

stavros a day ago | parent | prev [-]

You're not supposed to train for benchmarks, that's their entire point.