| ▲ | ACCount37 3 hours ago | ||||||||||||||||||||||
It's kind of the point? To test AI where it's weak instead of where it's strong. "Sample efficient rule inference where AI gets to control the sampling" seems like a good capability to have. Would be useful for science, for example. I'm more concerned by its overreliance on humanlike spatial priors, really. | |||||||||||||||||||||||
| ▲ | famouswaffles 2 hours ago | parent | next [-] | ||||||||||||||||||||||
ARC has always had that problem but for this round, the score is just too convoluted to be meaningful. I want to know how well the models can solve the problem. I may want to know how 'efficient' they are, but really I don't care if they're solving it in reasonable clock time and/or cost. I certainly do not want them jumbled into one messy convoluted score. 'Reasoning steps' here is just arbitrary and meaningless. Not only is there no utility to it unlike the above 2 but it's just incredibly silly to me to think we should be directly comparing something like that with entities operating in wildly different substrates. If I can't look at the score and immediately get a good idea of where things stand, then throw it way. 5% here could mean anything from 'solving only a tiny fraction of problems' to "solving everything correctly but with more 'reasoning steps' than the best human scores." Literally wildly different implications. What use is a score like that ? | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | jstummbillig 2 hours ago | parent | prev [-] | ||||||||||||||||||||||
It's an interesting point but I too find it questionable. Humans operate differently than machines. We don't design CPU benchmarks around how humans would approach a given computation. It's not entirely obvious why we would do it here (but it might still be a good idea, I am curious). | |||||||||||||||||||||||