The leaderboard sorting is very misleading, gpt-5.5-pro only found 2 while mimo-v2.5-pro found 4.5 out of 9 cases.