> Given that for a number of these benchmarks, it seems to be barely competitive with the previous gen

We're not reading the same numbers I think. Compared to Opus 4.6, it's a big jump nearly in every single bench GP posted. They're "only" catching up to Google's Gemini on GPQA and MMMLU but they're still beating their own Opus 4.6 results on these two.

This sounds like a much better model than Opus 4.6.

▲

ninjagoo 2 hours ago | parent [-]

> We're not reading the same numbers I think.

We must not be.

That's why I listed out the ones where it is barely competitive from @babelfish's table, which itself is extracted from Pg 186 & 187 of the System Card, which has the comparison with Opus 4.6, GPT 5.4 and Gemini 3.1 Pro.

Sure, it may be better than Opus 4.6 on some of those, but barely achieves a small increase over GPT-5.4 on the ones I called out.

▲

nimchimpsky an hour ago | parent [-]

barely competitive ? Mythos column is the first column.

You are the only person with this take on hackernews, everyone else "this is a massive a jump". Fwiwi, the data you list shows the biggest jump I remember for mythos

▲

devmor 23 minutes ago | parent [-]

The biggest jump in the numbers they quoted is 6%.

Please look at the columns OTHER than Opus as well.

	▲	josephg 9 minutes ago \| parent [-]
		> Combined results (Claude Mythos / Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro) > Terminal-Bench 2.0: 82.0% / 65.4% / 75.1% / 68.5% > USAMO: 97.6% / 42.3% / 95.2% / 74.4% > The biggest jump in the numbers they quoted is 6%. Just in the numbers you quoted, thats a 16.6% jump in terminal-bench and a 55.3% absolute increase in USAMO over their previous Opus 4.6 model.