How much of that is because the models are optimizing specifically for SWE bench?
not that much because its getting better at all benchmarks