| ▲ | falcor84 8 hours ago | |||||||||||||
They focus on minimizing the number of moves and don't allow any harness whatsoever, putting the bar extremely high. The current top verified contender (Claude Opus 4.6) is at only 0.45%. But with how new it is, I expect a lot of improvement in the next generation of models. | ||||||||||||||
| ▲ | threepts 7 hours ago | parent [-] | |||||||||||||
Optimal for judging actual reasoning ability rather than an LLM's ability to regurgitate knowledge from a necropost on HN/Reddit/Twitter from 2018. | ||||||||||||||
| ||||||||||||||