> Opus 4.5 is absolutely a state of the art model.

The field moves fast. Per artificialanalysis, Opus 4.5 is currently behind GPT-5.2 (x-high) and Gemini 3 Pro. Even Google's cheaper Gemini 3 Flash model seems to be slightly ahead of Opus 4.5.

▲

MrOrelliOReilly 2 days ago | parent | next [-]

Totally, however OP's point was that Claude had to compensate for deficiencies versus a state of the art model like ChatGPT 5.2. I don't think that's correct. Whether or not Opus 4.5 is actually #1 on these benchmarks, it is clearly very competitive with the other top-tier models. I didn't take "state of the art" to here narrowly mean #1 on a given benchmark, but rather to mean near or at the frontier of current capabilities.

▲

gessha 2 days ago | parent | prev | next [-]

One thing to remember when comparing ML models of any kind is that single value metrics obscure a lot of nuance and you really have to go through the model results one by one to see how it performs. This is true for vision, NLP, and other modalities.

▲

dr_dshiv 2 days ago | parent | prev | next [-]

https://lmarena.ai/leaderboard/webdev

LM Arena shows Claude Opus 4.5 on top

	▲	HarHarVeryFunny 2 days ago \| parent [-]
		I wonder how model competence and/or user preference on web development (that leaderboard) carries over to more complex and larger projects, or more generally anything other than web development ? In addition to whatever they are exposed to as part of pre-training, it'd be interesting to know what kind of coding tasks these models are being RL-trained for? Are things like web development and maybe Python/ML coding overemphasized, or are they also being trained on things like Linux/Windows/embedded development etc in different languages?

▲

ramoz 2 days ago | parent | prev | next [-]

https://x.com/giansegato/status/2002203155262812529/photo/1

https://x.com/METR_Evals/status/2002203627377574113

> Even Google's cheaper Gemini 3 Flash model seems to be slightly ahead of Opus 4.5.

What an insane take for anybody uses these models daily.

▲

MrOrelliOReilly 2 days ago | parent [-]

Yes, I personally feel that the "official" benchmarks are increasingly diverging from the everyday reality of using these models. My theory is that we are reaching a point where all the models are intelligent enough for day-to-day queries, so points like style/personality and proper use of web queries and other capabilities are better differentiators than intelligence alone.

	▲	int_19h 10 hours ago \| parent [-]
		The benchmarks haven't reflected the real utility for a very long time. At best they tell you which models are definitely bad.

▲

fzzzy 2 days ago | parent | prev [-]

is x-high fast enough to use as a coding agent?

	▲	wahnfrieden 2 days ago \| parent [-]
		Yes, if you parallelize your work, which you must learn to do if you want the best quality