Remix.run Logo
HarHarVeryFunny 4 hours ago

I'm suspect on how much of a coding advance it will be.

Seems odd that their announcement has zero coding benchmarks, with the closest related thing being terminal bench.

hereme888 3 hours ago | parent | next [-]

Tracking model performance on Artificial Analysis makes me think these models are constantly optimized/tuned in some way or another. GPT 5.5 was scoring in the mid 60's when it was first released, now it's almost 10 points higher.

jdw64 4 hours ago | parent | prev | next [-]

Maybe I'll know once I try it? Honestly, for small functions or methods, I don't think there's a huge difference between models. But the larger the code gets, the more noticeable the difference seems to be.

Personally, I think this kind of coding experience varies from person to person

vanuatu 4 hours ago | parent | prev | next [-]

sadly with all the labs benchmaxxing I feel like you just have to try the model for a while to really evaluate how good it is, especially for each individual use case

MangoCoffee 2 hours ago | parent | prev | next [-]

>zero coding benchmarks

"What gets measured gets managed"

artursapek 4 hours ago | parent | prev [-]

They claim extreme performance on ExploitBench, which Mythos was touted as being incredible at. https://x.com/OpenAI/status/2070555278576439306

HarHarVeryFunny 2 hours ago | parent | next [-]

My guess is that it's same base model as 5.5, but with additional post-training to improve and benchmaxx on a few things like that.

If they really thought it was competitive with Mythos/Fable across the board, then why wouldn't they release a broader set of benchmarks, and why price it day 1 at 1/2 the cost of Fable?

andriy_koval 3 hours ago | parent | prev [-]

On graph, they are still slightly bellow Mythos. Maybe enough to not be prohibited by US government?