| ▲ | DCKing 4 hours ago | |||||||
Any benchmark is iffy and has weird results, but this is the best we got at the moment. Most people working with Opus and Kimi would likely tell you they're much further apart than the numbers that were quoted for Kimi K2.6, and DeepSWE seems to capture that gap better. One major thing DeepSWE has going for it is that all other benchmarks (including those quoted by MoonshotAI on this page) don't: the other benchmarks that are completely gamed. The benchmark answers are public and part of each model's training data. This benchmark may still be iffy, but at least it's not gamed. | ||||||||
| ▲ | WarmWash 3 hours ago | parent [-] | |||||||
Somehow the internet has also forgot that cheating to get ahead in China is basically a norm and expected behavior. | ||||||||
| ||||||||