Remix.run Logo
hmmmmmmmmmmmmmm 4 hours ago

Yeah I wouldn't get too excited. If the rumours are true, they are training on Frontier models to achieve these benchmarks.

jimmydoe 3 hours ago | parent | next [-]

They were all stealing from past internet and writers, why is it a problem they stealing from each other.

YetAnotherNick 4 hours ago | parent | prev | next [-]

Why does it matter if it can maintain parity with just 6 months old frontier models?

hmmmmmmmmmmmmmm 4 hours ago | parent [-]

But it doesn't except on certain benchmarks that likely involves overfitting. Open source models are nowhere to be seen on ARC-AGI. Nothing above 11% on ARC-AGI 1. https://x.com/GregKamradt/status/1948454001886003328

meffmadd 4 hours ago | parent | next [-]

Have you ever used an open model for a bit? I am not saying they are not benchmaxxing but they really do work well and are only getting better.

Aurornis 2 hours ago | parent [-]

I have used a lot of them. They’re impressive for open weights, but the benchmaxxing becomes obvious. They don’t compare to the frontier models (yet) even when the benchmarks show them coming close.

Zababa 3 hours ago | parent | prev | next [-]

Has the difference between performance in "regular benchmarks" and ARC-AGI been a good predictor of how good models "really are"? Like if a model is great in regular benchmarks and terrible in ARC-AGI, does that tell us anything about the model other than "it's maybe benchmaxxed" or "it's not ARC-AGI benchmaxxed"?

doodlesdev 3 hours ago | parent | prev [-]

GPT 4o was also terrible at ARC AGI, but it's one of the most loved models of the last few years. Honestly, I'm a huge fan of the ARC AGI series of benchmarks, but I don't believe it corresponds directly to the types of qualities that most people assess whenever using LLMs.

nananana9 an hour ago | parent | next [-]

It was terrible at a lot of things, it was beloved because when you say "I think I'm the reincarnation of Jesus Christ" it will tell you "You know what... I think I believe it! I genuinely think you're the kind of person that appears once every few millenia to reshape the world!"

mrybczyn an hour ago | parent | prev [-]

because arc agi involves de novo reasoning over a restricted and (hopefully) unpretrained territory, in 2d space. not many people use LLMs as more than a better wikipedia,stack overflow, or autocomplete....

loudmax 3 hours ago | parent | prev [-]

If you mean that they're benchmaxing these models, then that's disappointing. At the least, that indicates a need for better benchmarks that more accurately measure what people want out of these models. Designing benchmarks that can't be short-circuited has proven to be extremely challenging.

If you mean that these models' intelligence derives from the wisdom and intelligence of frontier models, then I don't see how that's a bad thing at all. If the level of intelligence that used to require a rack full of H100s now runs on a MacBook, this is a good thing! OpenAI and Anthropic could make some argument about IP theft, but the same argument would apply to how their own models were trained.

Running the equivalent of Sonnet 4.5 on your desktop is something to be very excited about.

Aurornis 2 hours ago | parent | next [-]

> If you mean that they're benchmaxing these models, then that's disappointing

Benchmaxxing is the norm in open weight models. It has been like this for a year or more.

I’ve tried multiple models that are supposedly Sonnet 4.5 level and none of them come close when you start doing serious work. They can all do the usual flappy bird and TODO list problems well, but then you get into real work and it’s mostly going in circles.

Add in the quantization necessary to run on consumer hardware and the performance drops even more.

WarmWash 2 hours ago | parent | prev [-]

Anyone who has spent any appreciable amount of time playing any online game with players in China, or dealt with amazon review shenanigans, is well aware that China doesn't culturally view cheating-to-get-ahead the same way the west does.