| ▲ | megavon 2 days ago | |
Need to look at SWEBench-Pro, it's super competitive. Suspect they'll catch up given the longer-tail on TB scores. | ||
| ▲ | jaen 2 days ago | parent [-] | |
Just by the (lack of) inter-model variance, I don't think SWEBench-Pro does a very good job of representing model capability. Terminal-Bench seems more challenging and separates the wheat from the chaff. Also, *ops work, which in my experience can actually be more complicated than SWE is underrepresented there obviously. | ||