Remix.run Logo
jkelleyrtp 4 hours ago

claude swe-bench is 80.8 and codex is 56.8

Seems like 4.6 is still all-around better?

gizmodo59 4 hours ago | parent | next [-]

Its SWE bench pro not swe bench verified. The verified benchmark has stagnated

joshuahedlund 4 hours ago | parent [-]

Any ideas why verified has stagnated? It was increasing rapidly and then basically stopped.

Snuggly73 4 hours ago | parent [-]

it has been pretty much a benchmark for memorization for a while. there is a paper on the subject somewhere.

swe bench pro public is newer, but its not live, so it will get slowly memorized as well. the private dataset is more interesting, as are the results there:

https://scale.com/leaderboard/swe_bench_pro_private

Rudybega an hour ago | parent | prev [-]

You're comparing two different benchmarks. Pro vs Verified.