For one, they aren't using the latest version of many of the benchmarks. eg, ARC-AGI 2 and not 3, etc.