Remix.run Logo
sigmoid10 4 days ago

No, you misunderstood. I'm saying that for reasoning models, there is a lot of untapped capability in this test. I wouldn't be sure that there are hard limits in the sense that I think given enough compute, you'll probably find that a modern high end model will reach 100%. But you probably don't want to spend thousands (or perhaps tens of thousands) of dollars on that. There are much better tests out there if you have money to burn and want to find true hard limits compared to humans.