| ▲ | lordmauve 5 hours ago |
| Given DeepSWE just blew apart the SWE-Bench Pro benchmark and handed a 14-point lead to GPT-5.5, it looks pretty bad that they've listed SWE-Bench first in the model release and no DeepSWE. Like, this isn't obviously an answer. Or maybe it is, but publish the DeepSWE numbers so we can see for ourselves. |
|
| ▲ | phainopepla2 4 hours ago | parent [-] |
| I'm highly skeptical of DeepSWE. It rates GPT-5.4-mini as three times better than deepseek-v4-pro, but every time I use GPT-5.4-mini I find that it completely sucks at following directions. |
| |
| ▲ | lordmauve 3 hours ago | parent | next [-] | | I don't know if DeepSWE is genuinely a good benchmark. It's more important that their analysis demolished the validity of SWE-Bench Pro, objectively: it is being mismarked. I think that buys enough credibility to propose an alternative. I think there's a case to answer if Anthropic models underperform on a novel benchmark. I'd like to see more novel benchmarks to get a clearer picture. | |
| ▲ | sourcecodeplz 3 hours ago | parent | prev | next [-] | | It is the extra-high thinking, in artificialanalysis.ai it uses 240m tokens vs 40 GPT5.4/5, not worth it even with low price. | |
| ▲ | gck1 2 hours ago | parent | prev [-] | | Yeah, I share the same sentiment. I have yet to find a task where gpt-5.4-mini isn't bordering unusable. |
|