▲ | avs733 20 hours ago | |
3% to 40% is a 13x improvement 40% to 80% is a 2x improvement It’s not that the second leap isn’t impressive, it just doesn’t change your perspective on reality in the same way. | ||
▲ | viraptor 20 hours ago | parent | next [-] | |
Maybe... It will be interesting to see the improvements now compared to other benchmarks. Is 80->90% going to be an incremental fix with minimal impact on the next benchmark (same work but better), or is it going to be an overall 2x improvement on the remaining unsolved cases. (different approach tackling previously missed areas) It really depends on how that remaining improvement happens. We'll see it soon though - every benchmark nearing 90% is being replaced with something new. SWE-verified is almost dead now. | ||
▲ | energy123 20 hours ago | parent | prev | next [-] | |
80% to 100% would be an even smaller improvement but arguably the most impressive and useful (assuming the benchmark isn't in the training data) | ||
▲ | andyferris 20 hours ago | parent | prev [-] | |
I wouldn’t want to wait ages for Claude Code to fail 60% of the time. A 20% risk seems more manageable, and the improvements speak to better code and problem solving skills around. |