| ▲ | Eridrus 7 hours ago | |||||||||||||
The problem is that there are a bunch of benchmarks, the model providers often don't even use the same benchmarks, a bunch of them have known problems, and it's expensive to do your own benchmarks. I am a GPT 5.x booster since to me it just feels smarter, and I generally felt like the benchmarks backed me up, but it's not every benchmark, so sadly we're mostly arguing about vibes. SWEBench-Pro was a big one, though apparently Claude was reading solutions out of the .git folder it wasn't meant to have access to among other problems. | ||||||||||||||
| ▲ | smoe 7 hours ago | parent [-] | |||||||||||||
I find it fascinating that every time this kind of discussion comes up, people talk about night and day experiences between Claude and Codex, in both directions. I’m really wondering what people are doing to get such different outcomes. I’m currently working on two projects/clients one using Claude, one using Codex. I have a strong preference for the latter, but not because I think it is much more intelligent or writes much better code. It is simply because I find the way of interacting with it more pleasant: more literal, mechanical, makes fewer assumption and or double checks, and is less proactive in my experience. At least until some updates over the last few weeks. | ||||||||||||||
| ||||||||||||||