| ▲ | causal 2 hours ago | |||||||||||||||||||||||||
Thanks, I mostly agree with your approach except for one thing: eyesight feels like a "harness" that humans get to use and LLMs do not. I'm guessing you did not pass the human testers JSON blobs to work with, and suspect they would also score 0% without the eyesight and visual cortex harness to their reasoning ability. | ||||||||||||||||||||||||||
| ▲ | fchollet 2 hours ago | parent | next [-] | |||||||||||||||||||||||||
I'm all for testing humans and AI on a fair basis; how about we restrict testing to robots physically coming to our testing center to solve the environments via keyboard / mouse / screen like our human testers? ;-) (This version of the benchmark would be several orders of magnitude harder wrt current capabilities...) | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | fc417fc802 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||
The human testers were provided with their customary inputs, as were the LLMs. I don't see the issue. I guess it could be interesting to provide alternative versions that made available various representations of the same data. Still, I'd expect any AGI to be capable of ingesting more or less any plaintext representation interchangeably. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | 2 hours ago | parent | prev [-] | |||||||||||||||||||||||||
| [deleted] | ||||||||||||||||||||||||||