5.3 codex https://openai.com/index/introducing-gpt-5-3-codex/ crushes with a 77.3% in Terminal Bench. The shortest lived lead in less than 35 minutes. What a time to be alive!

▲ wasmainiac 5 hours ago | parent | next [-]

Dumb question. Can these benchmarks be trusted when the model performance tends to vary depending on the hours and load on OpenAI’s servers? How do I know I’m not getting a severe penalty for chatting at the wrong time. Or even, are the models best after launch then slowly eroded away at to more economical settings after the hype wears off?

▲ tedsanders 4 hours ago | parent | next [-]

We don't vary our model quality with time of day or load (beyond negligible non-determinism). It's the same weights all day long with no quantization or other gimmicks. They can get slower under heavy load, though.

(I'm from OpenAI.)

▲

derwiki a minute ago | parent | next [-]

Has this always been the case?

▲

wasmainiac an hour ago | parent | prev | next [-]

Thanks for the response, I appreciate it. I do notice variation in quality throughout the day. I use it primarily for searching documentation since it’s faster than google in most case, often it is on point, but also it seems off at times, inaccurate or shallow maybe. In some cases I just end the session.

	▲	nl an hour ago \| parent [-]
		Usually I find this kind of variation is due to context management. Accuracy can decreases at large context sizes. OpenAI's compaction handles this better than anyone else, but it's still an issue. If you are seeing this kind of thing start a new chat and re-run the same query. You'll usually see an improvement.

▲

fragmede 43 minutes ago | parent | prev | next [-]

I believe you when you say you're not changing the model file loaded onto the H100s or whatever, but there's something going on, beyond just being slower, when the GPUs are heavily loaded.

▲

zamadatix 2 hours ago | parent | prev | next [-]

I appreciate you taking the time to respond to these kinds of questions the last few days.

▲

Trufa 4 hours ago | parent | prev | next [-]

Can you be more specific than this? does it vary in time from launch of a model to the next few months, beyond tinkering and optimization?

▲

tedsanders 3 hours ago | parent | next [-]

Yeah, happy to be more specific. No intention of making any technically true but misleading statements.

The following are true:

- In our API, we don't change model weights or model behavior over time (e.g., by time of day, or weeks/months after release)

- Tiny caveats include: there is a bit of non-determinism in batched non-associative math that can vary by batch / hardware, bugs or API downtime can obviously change behavior, heavy load can slow down speeds, and this of course doesn't apply to the 'unpinned' models that are clearly supposed to change over time (e.g., xxx-latest). But we don't do any quantization or routing gimmicks that would change model weights.

- In ChatGPT and Codex CLI, model behavior can change over time (e.g., we might change a tool, update a system prompt, tweak default thinking time, run an A/B test, or ship other updates); we try to be transparent with our changelogs (listed below) but to be honest not every small change gets logged here. But even here we're not doing any gimmicks to cut quality by time of day or intentionally dumb down models after launch. Model behavior can change though, as can the product / prompt / harness.

ChatGPT release notes: https://help.openai.com/en/articles/6825453-chatgpt-release-...

Codex changelog: https://developers.openai.com/codex/changelog/

Codex CLI commit history: https://github.com/openai/codex/commits/main/

▲

Trufa 31 minutes ago | parent | next [-]

I ask then unironically then, am I imagining that models are great when they start and degrade over time?

I've had this perceived experience so many times, and while of course it's almost impossible to be objective about this, it just seem so in your face.

I don't discard being novelty plus getting used to it, plus psychological factors, do you have any takes on this?

▲

jychang 3 hours ago | parent | prev | next [-]

What about the juice variable?

https://www.reddit.com/r/OpenAI/comments/1qv77lq/chatgpt_low...

	▲	tedsanders 2 hours ago \| parent \| next [-]
		Yep, we recently sped up default thinking times in ChatGPT, as now documented in the release notes: https://help.openai.com/en/articles/6825453-chatgpt-release-... The intention was purely making the product experience better, based on common feedback from people (including myself) that wait times were too long. Cost was not a goal here. If you still want the higher reliability of longer thinking times, that option is not gone. You can manually select Extended (or Heavy, if you're a Pro user). It's the same as at launch (though we did inadvertently drop it last month and restored it yesterday after Tibor and others pointed it out).
	▲	tgrowazay 2 hours ago \| parent \| prev [-]
		Isn’t that just how many steps at most a reasoning model should do?

▲

ComplexSystems 3 hours ago | parent | prev [-]

Do you ever replace ChatGPT models with cheaper, distilled, quantized, etc ones to save cost?

	▲	jghn 3 hours ago \| parent [-]
		He literally said no to this in his GP post

▲

joshvm 3 hours ago | parent | prev [-]

My gut feeling is that performance is more heavily affected by harnesses which get updated frequently. This would explain why people feel that Claude is sometimes more stupid - that's actually accurate phrasing, because Sonnet is probably unchanged. Unless Anthropic also makes small A/B adjustments to weights and technically claims they don't do dynamic degradation/quantization based on load. Either way, both affect the quality of your responses.

It's worth checking different versions of Claude Code, and updating your tools if you don't do it automatically. Also run the same prompts through VS Code, Cursor, Claude Code in terminal, etc. You can get very different model responses based on the system prompt, what context is passed via the harness, how the rules are loaded and all sorts of minor tweaks.

If you make raw API calls and see behavioural changes over time, that would be another concern.

▲

Someone1234 4 hours ago | parent | prev [-]

Specifically including routing (i.e. which model you route to based on load/ToD)?

PS - I appreciate you coming here and commenting!

▲

hhh 4 hours ago | parent [-]

There is no routing with API, or when you choose a specific model in chatGPT.

	▲	3 hours ago \| parent [-]
		[deleted]

▲ Corence 5 hours ago | parent | prev | next [-]

It is a fair question. I'd expect the numbers are all real. Competitors are going to rerun the benchmark with these models to see how the model is responding and succeeding on the tasks and use that information to figure out how to improve their own models. If the benchmark numbers aren't real their competitors will call out that it's not reproducible.

However it's possible that consumers without a sufficiently tiered plan aren't getting optimal performance, or that the benchmark is overfit and the results won't generalize well to the real tasks you're trying to do.

	▲	mrandish 2 hours ago \| parent [-]
		> I'd expect the numbers are all real. I think a lot of people are concerned due to 1) significant variance in performance being reported by a large number of users, and 2) We have specific examples of OpenAI and other labs benchmaxxing in the recent past (https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...). It's tricky because there are so many subtle ways in which "the numbers are all real" could be technically true in some sense, yet still not reflect what a customer will experience (eg harnesses, etc). And any of those ways can benefit the cost structures of companies currently subsidizing models well below their actual costs with limited investor capital. All with billions of dollars in potential personal wealth at stake for company employees and dozens of hidden cost/performance levers at their disposal. And it doesn't even require overt deception on anyone's part. For example, the teams doing benchmark testing of unreleased new models aren't the same people as the ops teams managing global deployment/load balancing at scale day-to-day. If there aren't significant ongoing resources devoted to specifically validating those two things remain in sync - they'll almost certainly drift apart. And it won't be anyone's job to even know it's happening until a meaningful number of important customers complain or sales start to fall. Of course, if an unplanned deviation causes costs to rise over budget, it's a high-priority bug to be addressed. But if the deviation goes the other way and costs are little lower than expected, no one's getting a late night incident alert. This isn't even a dig at OpenAI in particular, it's just the default state of how large orgs work.

▲ ifwinterco 4 hours ago | parent | prev | next [-]

On benchmarks GPT 5.2 was roughly equivalent to Opus 4.5 but most people who've used both for SWE stuff would say that Opus 4.5 is/was noticeably better

▲ CraigJPerry 3 hours ago | parent | next [-]

There's an extended thinking mode for GPT 5.2 i forget the name of it right at this minute. It's super slow - a 3 minute opus 4.5 prompt is circa 12 minutes to complete in 5.2 on that super extended thinking mode but it is not a close race in terms of results - GPT 5.2 wins by a handy margin in that mode. It's just too slow to be useable interactively though.

	▲	ifwinterco 2 hours ago \| parent [-]
		Interesting, sounds like I definitely need to give the GPT models another proper go based on this discussion

▲ elAhmo 4 hours ago | parent | prev | next [-]

I mostly used Sonnet/Opus 4.x in the past months, but 5.2 Codex seemed to be on par or better for my use case in the past month. I tried a few models here and there but always went back to Claude, but with 5.2 Codex for the first time I felt it was very competitive, if not better.

Curious to see how things will be with 5.3 and 4.6

▲ georgeven 4 hours ago | parent | prev | next [-]

Interesting. Everyone in my circle said the opposite.

▲ MadnessASAP 41 minutes ago | parent | next [-]

My experience is that Codex follows directions better but Claude writes better code.

ChatGPT-5.2-Codex follows directions to ensure a task [bead](https://github.com/steveyegge/beads) is opened before starting a task and to keep it updated almost to a fault. Claude-Opus-4.5 with the exact same directions, forgets about it within a round or two. Similarly, I had a project that required very specific behaviour from a couple functions, it was documented in a few places including comments at the top and bottom of the function. Codex was very careful in ensuring the function worked as was documented. Claude decided it was easier to do the exact opposite, rewrote the function, the comments, and the documentation to saynit now did the opposite of what was previously there.

If I believed a LLM could be spiteful, I would've believed it on that second one. I certainly felt some after I realised what it had done. The comment literally said:

  // Invariant regardless of the value of X, this function cannot return Y

And it turned it into:

  // Returns Y if X is true

▲ krzyk 3 hours ago | parent | prev [-]

It probably depends on programming language and expectations.

	▲	ifwinterco 2 hours ago \| parent [-]
		This is mostly Python/TS for me... what Jonathan Blow would probably call not "real programming" but it pays the bills They can both write fairly good idiomatic code but in my experience opus 4.5 is better at understanding overall project structure etc. without prompting. It just does things correctly first time more often than codex. I still don't trust it obviously but out of all LLMs it's the closest to actually starting to earn my trust

▲ SatvikBeri 2 hours ago | parent | prev [-]

I pretty consistently heard people say Codex was much slower but produced better results, making it better for long-running work in the background, and worse for more interactive development.

▲ smcleod 3 hours ago | parent | prev | next [-]

I don't think much from OpenAI can be trusted tbh.

▲ aaaalone 5 hours ago | parent | prev | next [-]

At the end of the day you test it for your use cases anyway but it makes it a great initial hint if it's worth it to test out.

▲ cyanydeez 5 hours ago | parent | prev | next [-]

When do you think we should run this benchmark? Friday, 1pm? Monday 8AM? Wednesday 11AM?

I definitely suspect all these models are being degraded during heavy loads.

	▲	j_maffe 4 hours ago \| parent [-]
		This hypothesis is tested regularly by plenty of live benchmarks. The services usually don't decay in performance.

▲ thinkingtoilet 3 hours ago | parent | prev [-]

We know Open AI got caught getting benchmark data and tuning their models to it already. So the answer is a hard no. I imagine over time it gives a general view of the landscape and improvements, but take it with a large grain of salt.

	▲	rvz 2 hours ago \| parent [-]
		The same thing was done with Meta researchers with Llama 4 and what can go wrong when 'independent' researchers begin to game AI benchmarks. [0] You always have to question these benchmarks, especially when the in-house researchers can potentially game them if they wanted to. Which is why it must be independent. [0] https://gizmodo.com/meta-cheated-on-ai-benchmarks-and-its-a-...

▲ purplerabbit 6 hours ago | parent | prev | next [-]

The lack of broad benchmark reports in this makes me curious: Has OpenAI reverted to benchmaxxing? Looking forward to hearing opinions once we all try both of these out

▲

MallocVoidstar 5 hours ago | parent [-]

The -codex models are only for 'agentic coding', nothing else.

	▲	dingnuts 5 hours ago \| parent [-]
		[dead]

▲ nharada 6 hours ago | parent | prev | next [-]

That's a massive jump, I'm curious if there's a materially different feeling in how it works or if we're starting to reach the point of benchmark saturation. If the benchmark is good then 10 points should be a big improvement in capability...

▲ jkelleyrtp 6 hours ago | parent | prev [-]

claude swe-bench is 80.8 and codex is 56.8

Seems like 4.6 is still all-around better?

▲

gizmodo59 6 hours ago | parent | next [-]

Its SWE bench pro not swe bench verified. The verified benchmark has stagnated

▲

joshuahedlund 6 hours ago | parent [-]

Any ideas why verified has stagnated? It was increasing rapidly and then basically stopped.

	▲	Snuggly73 5 hours ago \| parent [-]
		it has been pretty much a benchmark for memorization for a while. there is a paper on the subject somewhere. swe bench pro public is newer, but its not live, so it will get slowly memorized as well. the private dataset is more interesting, as are the results there: https://scale.com/leaderboard/swe_bench_pro_private

▲

Rudybega 2 hours ago | parent | prev [-]

You're comparing two different benchmarks. Pro vs Verified.