▲	kingstnap 4 hours ago \| parent \| next [-]
		Its possibly label noise. But you can't tell from a single number. You would need to check to see if everyone is having mistakes on the same 20% or different 20%. If its the same 20% either those questions are really hard, or they are keyed incorrectly, or they aren't stated with enough context to actually solve the problem. It happens. Old MMLU non pro had a lot of wrong answers. Simple things like MNIST have digits labeled incorrect or drawn so badly its not even a digit anymore.
	▲	kenjackson 5 hours ago \| parent \| prev [-]
		Everyone is already at 80% for that one. Crazy that we were just at 50% with GPT-4o not that long ago.

▲

saberience 5 hours ago | parent | prev [-]

It's a useless meaningless benchmark though, it just got a catchy name, as in, if the models solve this it means they have "AGI", which is clearly rubbish.

Arc-AGI score isn't correlated with anything useful.

▲

Legend2440 3 hours ago | parent | next [-]

It's correlated with the ability to solve logic puzzles.

It's also interesting because it's very very hard for base LLMs, even if you try to "cheat" by training on millions of ARC-like problems. Reasoning LLMs show genuine improvement on this type of problem.

▲

HDThoreaun 22 minutes ago | parent | prev | next [-]

ARC-AGI 2 is an IQ test. IQ tests have been shown over and over to have predictive power in humans. People who score well on them tend to be more successful

▲

jabedude 5 hours ago | parent | prev [-]

how would we actually objectively measure a model to see if it is AGI if not with benchmarks like arc-AGI?

	▲	WarmWash 4 hours ago \| parent [-]
		Give it a prompt like >can u make the progm for helps that with what in need for shpping good cheap products that will display them on screen and have me let the best one to get so that i can quickly hav it at home And get back an automatic coupon code app like the user actually wanted.