The test is rigged because they used non thinking models.
These are reasoning / thinking models
Source?