Remix.run Logo
babelfish 3 hours ago

Combined results (Claude Mythos / Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro)

  SWE-bench Verified:        93.9% / 80.8% / —     / 80.6%
  SWE-bench Pro:             77.8% / 53.4% / 57.7% / 54.2%
  SWE-bench Multilingual:    87.3% / 77.8% / —     / —
  SWE-bench Multimodal:      59.0% / 27.1% / —     / —
  Terminal-Bench 2.0:        82.0% / 65.4% / 75.1% / 68.5%

  GPQA Diamond:              94.5% / 91.3% / 92.8% / 94.3%
  MMMLU:                     92.7% / 91.1% / —     / 92.6–93.6%
  USAMO:                     97.6% / 42.3% / 95.2% / 74.4%
  GraphWalks BFS 256K–1M:    80.0% / 38.7% / 21.4% / —

  HLE (no tools):            56.8% / 40.0% / 39.8% / 44.4%
  HLE (with tools):          64.7% / 53.1% / 52.1% / 51.4%

  CharXiv (no tools):        86.1% / 61.5% / —     / —
  CharXiv (with tools):      93.2% / 78.9% / —     / —

  OSWorld:                   79.6% / 72.7% / 75.0% / —
sourcecodeplz 3 hours ago | parent | next [-]

Haven't seen a jump this large since I don't even know, years? Too bad they are not releasing it anytime soon (there is no need as they are still currently the leader).

ru552 3 hours ago | parent | next [-]

There's speculation that next Tuesday will be a big day for OpenAI and possibly GPT 6. Anthropic showed their hand today.

varispeed 2 minutes ago | parent | next [-]

Sounds like a good opportunity to pause spending on nerfed 4.6 and wait for the new model to be released and then max out over 2 weeks before it gets nerfed again.

enraged_camel 3 hours ago | parent | prev | next [-]

That does not sound very believable. Last time Anthropic released a flagship model, it was followed by GPT Codex literally that afternoon.

cyanydeez 2 hours ago | parent [-]

Ya'll know they're teaching to the test. I'll wait till someone devises a novel test that isn't contained in the datasets. Sure, they're still powerful.

swalsh an hour ago | parent | prev [-]

My understanding is GPT 6 works via synaptic space reasoning... which I find terrifying. I hope if true, OpenAI does some safety testing on that, beyond what they normally do.

tyre 16 minutes ago | parent | next [-]

From the recent New Yorker piece on Sam:

“My vibes don’t match a lot of the traditional A.I.-safety stuff,” Altman said. He insisted that he continued to prioritize these matters, but when pressed for specifics he was vague: “We still will run safety projects, or at least safety-adjacent projects.” When we asked to interview researchers at the company who were working on existential safety—the kinds of issues that could mean, as Altman once put it, “lights-out for all of us”—an OpenAI representative seemed confused. “What do you mean by ‘existential safety’?” he replied. “That’s not, like, a thing.”

levocardia 36 minutes ago | parent | prev | next [-]

Oh you mean literally the thing in AI2027 that gets everyone killed? Wonderful.

notrealyme123 38 minutes ago | parent | prev | next [-]

That's sounds really interesting. Do you have some hints where to read more?

arm32 34 minutes ago | parent | prev [-]

Oh, of course they will /s

Jcampuzano2 3 hours ago | parent | prev [-]

A jump that we will never be able to use since we're not part of the seemingly minimum 100 billion dollar company club as requirement to be allowed to use it.

I get the security aspect, but if we've hit that point any reasonably sophisticated model past this point will be able to do the damage they claim it can do. They might as well be telling us they're closing up shop for consumer models.

They should just say they'll never release a model of this caliber to the public at this point and say out loud we'll only get gimped versions.

cedws 2 hours ago | parent | next [-]

More than killer AI I'm afraid of Anthropic/OpenAI going into full rent-seeking mode so that everyone working in tech is forced to fork out loads of money just to stay competitive on the market. These companies can also choose to give exclusive access to hand picked individuals and cut everyone else off and there would be nothing to stop them.

This is already happening to some degree, GPT 5.3 Codex's security capabilities were given exclusively to those who were approved for a "Trusted Access" programme.

TypesWillSaveUs 2 hours ago | parent | next [-]

Describing providing a highly valuable service for money as `rent seeking` is pretty wild.

bertil an hour ago | parent | next [-]

It could be, formally, if they have a monopoly.

However, I’m tempted to compare to GitHub: if I join a new company, I will ask to be included to their GitHub account without hesitation. I couldn’t possibly imagine they wouldn’t have one. What makes the cost of that subscription reasonable is not just GitHub’s fear a crowd with pitchforks showing to their office, by also the fact that a possible answer to my non-question might be “Oh, we actually use GitLab.”

If Anthropic is as good as they say, it seems fairly doable to use the service to build something comparable: poach a few disgruntled employees, leverage the promise to undercut a many-trillion-dollar company to be a many-billion dollar company to get investors excited.

I’m sure the founders of Anthropic will have more money than they could possibly spend in ten lifetimes, but I can’t imagine there wouldn’t be some competition. Maybe this time it’s different, but I can’t see how.

johnsimer 32 minutes ago | parent [-]

> It could be, formally, if they have a monopoly.

you have 2 labs at the forefront (Anthropic/OpenAI), Google closely behind, xAI/Meta/half a dozen chinese companies all within 6-12 months. There is plenty of competition and price of equally intelligent tokens rapidly drop whenever a new intelligence level is achieved.

Unless the leading company uses a model to nefariously take over or neutralize another company, I don't really see a monopoly happening in the next 3 years.

1attice 2 hours ago | parent | prev [-]

My housing is pretty valuable. I pay rent. Which timeline are you in?

kaashif an hour ago | parent | next [-]

Rent seeking refers to https://en.wikipedia.org/wiki/Rent-seeking

bonsai_spool an hour ago | parent | prev | next [-]

Actually you're saying similar things:

Rent-seeking of old was a ground rent, monies paid for the land without considering the building that was on it.

Residential rents today often have implied warrants because of modern law, so your landlord is essentially selling you a service at a particular location.

mhluongo 35 minutes ago | parent | prev [-]

Two different "rent"s.

aspenmartin 2 hours ago | parent | prev | next [-]

Well don’t forget we still have competition. Were anthropic to rent seek OpenAI would undercut them. Were OpenAI and anthropic to collude that would be illegal. For anthropic to capture the entire coding agent market and THEN rent seek, these days it’s never been easier to raise $1B and start a competing lab

cedws 2 hours ago | parent [-]

In practice this doesn't work though, the Mastercard-Visa duopoly is an example, two competing forces doesn't create aggressive enough competition to benefit the consumer. The only hope we have is the Chinese models, but it will always be too expensive to run the full models for yourself.

brokencode 2 hours ago | parent | next [-]

New companies can enter this space. Google’s competing, though behind. Maybe Microsoft, Meta, Amazon, or Apple will come out with top notch models at some point.

There is no real barrier to a customer of Anthropic adopting a competing model in the future. All it takes is a big tech company deciding it’s worth it to train one.

On the other hand, Visa/Mastercard have a lot of lock-in due to consumers only wanting to get a card that’s accepted everywhere, and merchants not bothering to support a new type of card that no consumer has. There’s a major chicken and egg problem to overcome there.

sghiassy 2 hours ago | parent | prev [-]

Chinese competition can always be banned. Example: Chinese electric car competition

sho_hn 2 hours ago | parent | next [-]

That's what OP was saying, I think, noting that running them locally won't be a solution.

oblio an hour ago | parent | prev [-]

Also Chinese smartphones. Huawei was about 12-18 months from becoming the biggest smartphone manufacturer in the world a few years ago. If it would have been allowed to sell its phones freely in the US I'm fairly sure Apple would have been closer to Nokia than to current day Apple.

aurareturn 42 minutes ago | parent [-]

If Huawei was never banned from using TSMC, they'd likely have a real Nvidia competitor and may have surpassed Apple in mobile chip designs.

They actually beat Apple A series to become the first phone to use the TSMC N7 node.

therealdeal2020 an hour ago | parent | prev | next [-]

but you are assuming that the magical wizards are the only ones who can create powerful AIs... mind you these people have been born just few decades ago. Their knowledge will be transferred and it will only take a few more decades until anyone can train powerful AIs ... you can only sit on tech for so long before everyone knows how to do it

cedws 37 minutes ago | parent [-]

It's not a matter of knowledge, it's a matter of resources. It takes billions of dollars of hardware to train a SOTA LLM and it's increasing all the time. You cannot possibly hope to compete as an independent or small startup.

block_dagger a few seconds ago | parent [-]

Presumably, the hardware to run this level of model will be democratized within the timeframe of the parent comment.

MattRix an hour ago | parent | prev [-]

The thing is that the current models can ALREADY replicate most software-based products and services on the market. The open source models are not far behind. At a certain point I'm not sure it matters if the frontier models can do faster and better. I see how they're useful for really complex and cutting edge use cases, but that's not what most people are using them for.

guzfip 2 hours ago | parent | prev | next [-]

> A jump that we will never be able to use since we're not part of the seemingly minimum 100 billion dollar company club as requirement to be allowed to use it.

> They should just say they'll never release a model of this caliber to the public at this point and say out loud we'll only get gimped

Duh, this was fucking obvious from the start. The only people saying otherwise were zealots who needed a quick line to dismiss legitimate concerns.

quotemstr 2 hours ago | parent | prev [-]

This is why the EAs, and their almost comic-book-villain projects like "control AI dot com" cannot be allowed to win. One private company gatekeeping access to revolutionary technology is riskier than any consequence of the technology itself.

scrawl an hour ago | parent | next [-]

Having done a quick search of "control AI dot com", it seems their intent is educate lawmakers & government in order to aid development of a strong regulatory framework around frontier AI development.

Not sure how this is consistent with "One private company gatekeeping access to revolutionary technology"?

quotemstr 42 minutes ago | parent [-]

> strong regulatory framework around frontier AI development

You have to decode feel-good words into the concrete policy. The EAs believe that the state should prohibit entities not aligned with their philosophy to develop AIs beyond a certain power level.

frozenseven 2 hours ago | parent | prev | next [-]

Couldn't agree more. The "safest" AI company is actually the biggest liability. I hope other companies make a move soon.

FeepingCreature 2 hours ago | parent | prev [-]

No it isn't lol. The consequence of the technology literally includes human extinction. I prefer 0 companies, but I'll take 1 over 5.

WarmWash 2 hours ago | parent | prev | next [-]

Are these fair comparisons? It seems like mythos is going to be like a 5.4 ultra or Gemini Deepthink tier model, where access is limited and token usage per query is totally off the charts.

mulmboy 2 hours ago | parent [-]

There are a few hints in the doc around this

> Importantly, we find that when used in an interactive, synchronous, “hands-on-keyboard” pattern, the benefits of the model were less clear. When used in this fashion, some users perceived Mythos Preview as too slow and did not realize as much value. Autonomous, long-running agent harnesses better elicited the model’s coding capabilities. (p201)

^^ From the surrounding context, this could just be because the model tends to do a lot of work in the background which naturally takes time.

> Terminal-Bench 2.0 timeouts get quite restrictive at times, especially with thinking models, which risks hiding real capabilities jumps behind seemingly uncorrelated confounders like sampling speed. Moreover, some Terminal-Bench 2.0 tasks have ambiguities and limited resource specs that don’t properly allow agents to explore the full solution space — both being currently addressed by the maintainers in the 2.1 update. To exclusively measure agentic coding capabilities net of the confounders, we also ran Terminal-Bench with the latest 2.1 fixes available on GitHub, while increasing the timeout limits to 4 hours (roughly four times the 2.0 baseline). This brought the mean reward to 92.1%. (p188)

> ...Mythos Preview represents only a modest accuracy improvement over our best Claude Opus 4.6 score (86.9% vs. 83.7%). However, the model achieves this score with a considerably smaller token footprint: the best Mythos Preview result uses 4.9× fewer tokens per task than Opus 4.6 (226k vs. 1.11M tokens per task). (p191)

alyxya an hour ago | parent [-]

The first point is along the lines of what I'd expect given that claude code is generally reliable at this point. A model's raw intelligence doesn't seem as important right now compared to being able to support arbitrary length context.

ninjagoo 35 minutes ago | parent | prev | next [-]

> Combined results (Claude Mythos / Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro)

> Terminal-Bench 2.0: 82.0% / 65.4% / 75.1% / 68.5%

> GPQA Diamond: 94.5% / 91.3% / 92.8% / 94.3%

> MMMLU: 92.7% / 91.1% / — / 92.6–93.6%

> USAMO: 97.6% / 42.3% / 95.2% / 74.4%

> OSWorld: 79.6% / 72.7% / 75.0% / —

Given that for a number of these benchmarks, it seems to be barely competitive with the previous gen Opus 4.6 or GPT-5.4, I don't know what to make of the significant jumps on other benchmarks within these same categories. Training to the test? Better training?

And the decision to withhold general release (of a 'preview' no less!) seems to be well, odd. And the decision to release a 'preview' version to specific companies? You know any production teams at these massive companies that would work with a 'preview' anything? R&D teams, sure, but production? Part of me wants to LoL.

What are they trying to do? Induce FOMO and stop subscriber bleed-out stemming from the recent negative headlines around problems with using Claude?

TacticalCoder 25 minutes ago | parent [-]

> Given that for a number of these benchmarks, it seems to be barely competitive with the previous gen

We're not reading the same numbers I think. Compared to Opus 4.6, it's a big jump nearly in every single bench GP posted. They're "only" catching up to Google's Gemini on GPQA and MMMLU but they're still beating their own Opus 4.6 results on these two.

This sounds like a much better model than Opus 4.6.

ninjagoo 12 minutes ago | parent [-]

> We're not reading the same numbers I think.

We must not be.

That's why I listed out the ones where it is barely competitive from @babelfish's table, which itself is extracted from Pg 186 & 187 of the System Card, which has the comparison with Opus 4.6, GPT 5.4 and Gemini 3.1 Pro.

Sure, it may be better than Opus 4.6 on some of those, but barely achieves a small increase over GPT-5.4 on the ones I called out.

pants2 3 hours ago | parent | prev | next [-]

We're gonna need some new benchmarks...

ARC-AGI-3 might be the only remaining benchmark below 50%

Leynos 2 hours ago | parent | next [-]

Opus 4.6 currently leads the remote labor index at 4.17. GPT-5.4 isn't measured on that one though: https://www.remotelabor.ai/

GPT 5.4 Pro leads Frontier Maths Tier 4 at 35%: https://epoch.ai/benchmarks/frontiermath-tier-4/

randomtoast 2 hours ago | parent | prev [-]

[dead]

AlexC04 an hour ago | parent | prev | next [-]

but how does it perform on pelican riding a bicycle bench? why are they hiding the truth?!

(edit: I hope this is an obvious joke. less facetiously these are pretty jaw dropping numbers)

bertil an hour ago | parent [-]

We are all fans for Simon’s work, and his test is, strangely enough, quite good.

whalesalad 3 hours ago | parent | prev | next [-]

Honestly we are all sleeping on GPT-5.4. Particularly with the influx of Claude users recently (and increasingly unstable platform) Codex has been added to my rotation and it's surprising me.

babelfish 3 hours ago | parent | next [-]

Totally. Best-in-class for SWE work (until Mythos gets released, if ever, but I suspect the rumored "Spud" will be out by then too)

girvo an hour ago | parent [-]

It really isn’t. I wish it was, because work complains about overuse of Opus.

rafaelmn 3 hours ago | parent | prev [-]

GPT is shit at writing code. It's not dumb - extra high thinking is really good at catching stuff - but it's like letting a smart junior into your codebase - ignore all the conventions, surrounding context, just slop all over the place to get it working. Claude is just a level above in terms of editing code.

sho_hn 2 hours ago | parent | next [-]

Very different experience for me. Codex 5.3+ on xhigh are the only models I've tried so far that write reasonably decent C++ (domains: desktop GUI, robotics, game engine dev, embedded stuff, general systems engineering-type codebases), and idiomatic code in languages not well-represented in training data, e.g. QML. One thing I like is explicitly that it knows better when to stop, instead of brute-forcing a solution by spamming bespoke helpers everywhere no rational dev would write that way.

Not always, no, and it takes investment in good prompting/guardrails/plans/explicit test recipes for sure. I'm still on average better at programming in context than Codex 5.4, even if slower. But in terms of "task complexity I can entrust to a model and not be completely disappointed and annoyed", it scores the best so far. Saves a lot on review/iteration overhead.

It's annoying, too, because I don't much like OpenAI as a company.

(Background: 25 years of C++ etc.)

boring-human 37 minutes ago | parent [-]

Same background as you, and same exact experience as you. Opus and Gemini have not come close to Codex for C++ work. I also run exclusively on xhigh. Its handling of complexity is unmatched.

At least until next week when Mythos and GPT 6 throw it all up in the air again.

Jcampuzano2 3 hours ago | parent | prev | next [-]

Not my experience. GPT 5.4 walks all over Claude from what I've worked with and its Claude that is the one willing to just go do unnecessary stuff that was never asked for or implement the more hacky solutions to things without a care for maintainability/readability.

But I do not use extra high thinking unless its for code review. I sit at GPT 5.4 high 95% of the time.

zarzavat 3 hours ago | parent | prev | next [-]

Yes, it's becoming clear that OpenAI kinda sucks at alignment. GPT-5 can pass all the benchmarks but it just doesn't "feel good" like Claude or Gemini.

lilytweed 2 hours ago | parent | next [-]

Whenever I come back to ChatGPT after using Claude or Gemini for an extended period, I’m really struck by the “AI-ness.” All the verbal tics and, truly, sloppishness, have been trained away by the other, more human-feeling models at this point.

chaos_emergent 2 hours ago | parent | prev [-]

An alternative but similar formulation of that statement is that Anthropic has spent more training effort in getting the model to “feel good” rather than being correct on verifiable tasks. Which more or less tracks with my experience of using the model.

leobuskin 3 hours ago | parent | prev | next [-]

And as a bonus: GPT is slow. I’m doing a lot of RE (IDA Pro + MCP), even when 5.4 gives a little bit better guesses (rarely, but happens) - it takes x2-x4 longer. So, it’s just easier to reiterate with Opus

blazespin an hour ago | parent [-]

Yeah, need some good RE benchmarks for the LLMs. :)

RE is very interesting problem. A lot more that SWE can be RE'd. I've found the LLMs are reluctant to assist, though you can workaround.

porker 44 minutes ago | parent [-]

What is RE in this context?

astrange 30 minutes ago | parent [-]

Reverse engineering

whalesalad 3 hours ago | parent | prev [-]

This has been my experience. With very very rigid constraints it does ok, but without them it will optimize expediency and getting it done at the expense of integrating with the broader system.

ctoth 2 hours ago | parent [-]

My favorite example of this from last night:

Me: Let's figure out how to clone our company Wordpress theme in Hugo. Here're some tools you can use, here's a way to compare screenshots, iterate until 0% difference.

Codex: Okay Boss! I did the thing! I couldn't get the CSS to match so I just took PNGs of the original site and put them in place! Matches 100%!

simianwords 3 hours ago | parent | prev [-]

The real part is SWE-bench Verified since there is no way to overfit. That's the only one we can believe.

ollin 2 hours ago | parent [-]

My impression was entirely the opposite; the unsolved subset of SWE-bench verified problems are memorizable (solutions are pulled from public GitHub repos) and the evaluators are often so brittle or disconnected from the problem statement that the only way to pass is to regurgitate a memorized solution.

OpenAI had a whole post about this, where they recommended switching to SWE-bench Pro as a better (but still imperfect) benchmark:

https://openai.com/index/why-we-no-longer-evaluate-swe-bench...

> We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions

> SWE-bench problems are sourced from open-source repositories many model providers use for training purposes. In our analysis we found that all frontier models we tested were able to reproduce the original, human-written bug fix

> improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time

> We’re building new, uncontaminated evaluations to better track coding capabilities, and we think this is an important area to focus on for the wider research community. Until we have those, OpenAI recommends reporting results for SWE-bench Pro.

simianwords 2 hours ago | parent [-]

I stand corrected.