Lol basically we're saying AI isn't AI if we utilize the strength of computers (being able to compute). There's no reason why AGI should have to be as "sample efficient" as humans if it can achieve the same result in less time.

▲

pptr 2 minutes ago | parent | next [-]

Let's say an agent needs to do 10 brain surgeries on a human to remove a tumor and a human doctor can do it in a single surgery. I would prefer the human.

"steps" are important to optimize if they have negative externalities.

▲

ACCount37 3 hours ago | parent | prev | next [-]

It's kind of the point? To test AI where it's weak instead of where it's strong.

"Sample efficient rule inference where AI gets to control the sampling" seems like a good capability to have. Would be useful for science, for example. I'm more concerned by its overreliance on humanlike spatial priors, really.

▲

famouswaffles 2 hours ago | parent | next [-]

ARC has always had that problem but for this round, the score is just too convoluted to be meaningful. I want to know how well the models can solve the problem. I may want to know how 'efficient' they are, but really I don't care if they're solving it in reasonable clock time and/or cost. I certainly do not want them jumbled into one messy convoluted score.

'Reasoning steps' here is just arbitrary and meaningless. Not only is there no utility to it unlike the above 2 but it's just incredibly silly to me to think we should be directly comparing something like that with entities operating in wildly different substrates.

If I can't look at the score and immediately get a good idea of where things stand, then throw it way. 5% here could mean anything from 'solving only a tiny fraction of problems' to "solving everything correctly but with more 'reasoning steps' than the best human scores." Literally wildly different implications. What use is a score like that ?

▲

pants2 an hour ago | parent | next [-]

The measurement metric is in-game steps. Unlimited reasoning between steps is fine.

This makes sense to me. Most actions have some cost associated, and as another poster stated it's not interesting to let models brute-force a solution with millions of steps.

	▲	famouswaffles an hour ago \| parent [-]
		Same thing in this case. No Utility and just as arbitrary. None of the issues with the score change. Models do not brute force solutions in that manner. If they did, we'd wait the lifetimes of several universes before we could expect a significant result. Regardless, since there's a x5 step cuttof, 'brute forcing with millions of steps' was never on the table.

▲

thereitgoes456 9 minutes ago | parent | prev [-]

The metric is very similar to cost. It seems odd to justify one and not the other.

▲

jstummbillig 2 hours ago | parent | prev [-]

It's an interesting point but I too find it questionable. Humans operate differently than machines. We don't design CPU benchmarks around how humans would approach a given computation. It's not entirely obvious why we would do it here (but it might still be a good idea, I am curious).

▲

cyanydeez 3 hours ago | parent | prev [-]

I think your logic isn't sound: Wouldn't we want a "intelligence" to solve problems efficiently rather than brute force a million monkies? There's defnitely a limit to compute, the same ways there's a limit to how much oil we can use, etc.

In theory, sure, if I can throw a million monkies and ramble into a problem solution, it doesnt matter how I got there. In practice though, every attempt has a direct and indirect impact on the externalities. You can argue those externalities are minor, but the largesse of money going to data centers suggests otherwise.

Lastly, humans use way less energy to solve these in fewer steps, so of course it matter when you throw Killowatts at something that takes milliwatts to solve.

▲

diego_sandoval 2 hours ago | parent [-]

> Lastly, humans use way less energy to solve these in fewer steps,

Not if you count all the energy that was necessary to feed, shelter and keep the the human at his preferred temperature so that he can sit in front of a computer and solve the problem.

▲

cyanydeez an hour ago | parent [-]

ok, but thats the same for bulding a data center.

Try again.

	▲	fsdf2 a minute ago \| parent \| next [-]
		Oh and who provided the 'food' for the models? ... People who write the stuff like the poster above you... are bizzro. Absolutely bizarro. Did the LLM manfiest itself into existence? Wtf.
	▲	gunalx 34 minutes ago \| parent \| prev [-]
		Yes, especially when considering a dataceter needed the energy of pretty many people to be built. A single human is indeed more efficent, and way more flexible and actually just general intelligence.