Remix.run Logo
MrOrelliOReilly 2 days ago

I think this is a total misunderstanding of Anthropic’s place in the AI race. Opus 4.5 is absolutely a state of the art model. I won’t knock anyone for preferring Codex, but I think you’re ignoring official and unofficial benchmarks.

See: https://artificialanalysis.ai

woadwarrior01 2 days ago | parent | next [-]

> Opus 4.5 is absolutely a state of the art model.

> See: https://artificialanalysis.ai

The field moves fast. Per artificialanalysis, Opus 4.5 is currently behind GPT-5.2 (x-high) and Gemini 3 Pro. Even Google's cheaper Gemini 3 Flash model seems to be slightly ahead of Opus 4.5.

MrOrelliOReilly 2 days ago | parent | next [-]

Totally, however OP's point was that Claude had to compensate for deficiencies versus a state of the art model like ChatGPT 5.2. I don't think that's correct. Whether or not Opus 4.5 is actually #1 on these benchmarks, it is clearly very competitive with the other top-tier models. I didn't take "state of the art" to here narrowly mean #1 on a given benchmark, but rather to mean near or at the frontier of current capabilities.

gessha 2 days ago | parent | prev | next [-]

One thing to remember when comparing ML models of any kind is that single value metrics obscure a lot of nuance and you really have to go through the model results one by one to see how it performs. This is true for vision, NLP, and other modalities.

dr_dshiv 2 days ago | parent | prev | next [-]

https://lmarena.ai/leaderboard/webdev

LM Arena shows Claude Opus 4.5 on top

HarHarVeryFunny 2 days ago | parent [-]

I wonder how model competence and/or user preference on web development (that leaderboard) carries over to more complex and larger projects, or more generally anything other than web development ?

In addition to whatever they are exposed to as part of pre-training, it'd be interesting to know what kind of coding tasks these models are being RL-trained for? Are things like web development and maybe Python/ML coding overemphasized, or are they also being trained on things like Linux/Windows/embedded development etc in different languages?

ramoz 2 days ago | parent | prev | next [-]

https://x.com/giansegato/status/2002203155262812529/photo/1

https://x.com/METR_Evals/status/2002203627377574113

> Even Google's cheaper Gemini 3 Flash model seems to be slightly ahead of Opus 4.5.

What an insane take for anybody uses these models daily.

MrOrelliOReilly 2 days ago | parent [-]

Yes, I personally feel that the "official" benchmarks are increasingly diverging from the everyday reality of using these models. My theory is that we are reaching a point where all the models are intelligent enough for day-to-day queries, so points like style/personality and proper use of web queries and other capabilities are better differentiators than intelligence alone.

int_19h 10 hours ago | parent [-]

The benchmarks haven't reflected the real utility for a very long time. At best they tell you which models are definitely bad.

fzzzy 2 days ago | parent | prev [-]

is x-high fast enough to use as a coding agent?

wahnfrieden 2 days ago | parent [-]

Yes, if you parallelize your work, which you must learn to do if you want the best quality

wahnfrieden 2 days ago | parent | prev [-]

What am I missing? As suspicious as benchmarks are, your link shows GPT 5.2 to be superior.

It is also out of date as it does not include 5.2 Codex.

Per my point about steerability compensated for by modalities and other harness features: Opus 4.5 scores 58% while GPT 5.2 scores 75% for the instruction following benchmark in your link! Thanks for the hard evidence - GPT 5.2 is 30% ahead of Opus 4.5 there. No wonder Claude Code needs those harness features for the user to manually reign in control over its instruction following capability.