Remix.run Logo
mohsen1 4 hours ago

Since they are not showing you how this model compares against the benchmarks they are showing, here is a quick view with the public numbers from Google and Anthropic. At least this gives some context:

    SWE-Bench (Pro / Verified)

    Model               | Pro (%) | Verified (%)
    --------------------+---------+--------------
    GPT-5.2-Codex       | 56.4    | ~80
    GPT-5.2             | 55.6    | ~80
    Claude Opus 4.5     | n/a     | ~80.9
    Gemini 3 Pro        | n/a     | ~76.2

And for terminal workflows, where agentic steps matter:

    Terminal-Bench 2.0

    Model               | Score (%)
    --------------------+-----------
    Claude Opus 4.5     | ~60+
    Gemini 3 Pro        | ~54
    GPT-5.2-Codex       | ~47

So yes, GPT-5.2-Codex is good, but when you put it next to its real competitors:

- Claude is still ahead on strict coding + terminal-style tasks

- Gemini is better for huge context + multimodal reasoning

- GPT-5.2-Codex is strong but not clearly the new state of the art across the board

It feels a bit odd that the page only shows internal numbers instead of placing them next to the other leaders.

thedougd 8 minutes ago | parent | next [-]

I'm finding that the newer GPT models are much more willing to leverage tools/skills than Claude, reducing interventions requesting approval. Just an observation.

qwesr123 4 hours ago | parent | prev | next [-]

Where are you getting SWE-Bench Verified scores for 5.2-Codex? AFAIK those have not been published.

And I don't think your Terminal-Bench 2.0 scores are accurate. Per the latest benchmarks: Opus 4.5 is at 59% GPT-5.2-Codex is at 64%

See the charts at the bottom of https://marginlab.ai/blog/swe-bench-deep-dive/ and https://marginlab.ai/blog/terminal-bench-deep-dive/

scellus 3 hours ago | parent | prev | next [-]

I like Opus 4.5 a lot, but a general comment on benchmarks: the number of subtasks or problems in each one is finite, and many of the benchmarks are saturating, so the effective number of problems at the frontier is even smaller. If you think of the generalizable capability of the model as a latent feature to be measured by benchmarks, we therefore have only rather noisy estimates. People read too much into small differences in numbers. It's best to aggregate across many, Epoch has their Capabilities Index, and Artificial Analysis is doing something similar, and probably others I don't know or remember.

And then there's the part of models that is hard to measure. Opus has some sort of HAL-like smoothness I don't see in other models, but meanwhile, I haven't tried gpt-5.2 for coding yet. (Neither Gemini 3 Pro; I'm not claiming superiority of Opus, just that something in practical usability is hard to measure.)

blitz_skull 2 hours ago | parent | prev [-]

Ahhh, there it is.

My rule of thumb with OpenAI is, if they don’t publish their benchmarks beside Anthropic’s numbers it’s because they’re still not caught up.

So far my rule of thumb has held true.