| ▲ | osti 5 hours ago | ||||||||||||||||
Somehow regresses on SWE bench? | |||||||||||||||||
| ▲ | lkbm 5 hours ago | parent | next [-] | ||||||||||||||||
I don't know how these benchmarks work (do you do a hundred runs? A thousand runs?), but 0.1% seems like noise. | |||||||||||||||||
| ▲ | SubiculumCode 5 hours ago | parent | prev | next [-] | ||||||||||||||||
That benchmark is pretty saturated, tbh. A "regression" of such small magnitude could mean many different things or nothing at all. | |||||||||||||||||
| ▲ | usaar333 5 hours ago | parent | prev [-] | ||||||||||||||||
i'd interpret that as rounding error. that is unchanged swe-bench seems really hard once you are above 80% | |||||||||||||||||
| |||||||||||||||||