▲ | tedsanders 21 hours ago | |||||||||||||
> SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors SWE-bench is a great eval, but it's very narrow. Two models can have the same SWE-bench scores but very different user experiences. Here's a nice thread on X about the things that SWE-bench doesn't measure: | ||||||||||||||
▲ | dwaltrip 20 hours ago | parent [-] | |||||||||||||
so annoying you cant read replies without an account nowadays | ||||||||||||||
|