Remix.run Logo
Imnimo 2 hours ago

Suppose you construct a Mechanical Turk AI who plays ARC-AGI-3 by, for each task, randomly selecting one of the human players who attempted it, and scoring them as an AI taking those same actions would be scored. What score does this Turk get? It must be <100% since sometimes the random human will take more steps than the second best, but without knowing whether it's 90% or 50% it's very hard for me to contextualize AI scores on this benchmark.