The problem is that there are a bunch of benchmarks, the model providers often don't even use the same benchmarks, a bunch of them have known problems, and it's expensive to do your own benchmarks.

I am a GPT 5.x booster since to me it just feels smarter, and I generally felt like the benchmarks backed me up, but it's not every benchmark, so sadly we're mostly arguing about vibes.

SWEBench-Pro was a big one, though apparently Claude was reading solutions out of the .git folder it wasn't meant to have access to among other problems.

▲

smoe 7 hours ago | parent [-]

I find it fascinating that every time this kind of discussion comes up, people talk about night and day experiences between Claude and Codex, in both directions. I’m really wondering what people are doing to get such different outcomes.

I’m currently working on two projects/clients one using Claude, one using Codex. I have a strong preference for the latter, but not because I think it is much more intelligent or writes much better code. It is simply because I find the way of interacting with it more pleasant: more literal, mechanical, makes fewer assumption and or double checks, and is less proactive in my experience. At least until some updates over the last few weeks.

	▲	Eridrus 5 hours ago \| parent \| next [-]
		I think I like Codex for the same reason tbh. I think it's just general misanthropy or autism or something lol. Most people seem to prefer Claude. For me, I think Codex was visibly smarter than Claude until 4.8 came out, it would regularly do better debugging and IMO write better code. 4.8 I think is close. I think Claude is widely regarded to have a big lead in front-end, which I do not work on. Claude's Ultrathink is pretty cool, though it eats up tokens like nothing else obviously.
	▲	AlphaSite 4 hours ago \| parent \| prev [-]
		It probably means they’re close enough that there’s no observable difference. Or better at every different things.