Remix.run Logo
esafak 4 hours ago

Has anyone noticed that models are dropping ever faster, with pressure on companies to make incremental releases to claim the pole position, yet making strides on benchmarks? This is what recursive self-improvement with human support looks like.

emp17344 4 hours ago | parent | next [-]

Remember when ARC 1 was basically solved, and then ARC 2 (which is even easier for humans) came out, and all of the sudden the same models that were doing well on ARC 1 couldn’t even get 5% on ARC 2? Not convinced these benchmark improvements aren’t data leakage.

casey2 25 minutes ago | parent [-]

ARC 2 was made specifically to artificially lower contemporary LLM scores, therefore any kind of model improvements will have outsized effects

Also people use "saturated" too liberally. The top left corner 1 cent per task is saturated IMO. Since there are billions of people who would perfer to solve arc 1 tasks at 52 cents per task. Arc 2 a human would make thousands of dollars a day with 99.99% accuracy

alisonkisk 4 minutes ago | parent [-]

You are saying something interesting but too esoteric. Can you explain for beginners?

redox99 4 hours ago | parent | prev | next [-]

I don't think there's much recursive improvement yet.

I'd say it's a combination of

A) Before, new model releases were mostly a new base model trained from scratch, with more parameters and more tokens. This takes many Months. Now that RL is used so heavily, you can make infinitely many tweaks to the RL setup, and in just a month get a better model using the same base model.

B) There's more compute online

C) Competition is more fierce.

m_ke 3 hours ago | parent | prev | next [-]

this is mostly because RLVR is driving all of the recent gains, and you can continue improving the model by running it longer (+ adding new tasks / verifiers)

so we'll keep seeing more frequent flag planting checkpoint releases to not allow anyone to be able to claim SOTA for too long

ankit219 3 hours ago | parent | prev | next [-]

not much to do with self improvement as such. openai has increased its pace, others are pretty much consistent. Google last year had three versions of gemini-2.5-pro each within a month of each other. Anthropic released claude 3 in march 24, sonnet 3.5 in june 24, 3.5 new in oct 24, and then 3.7 in feb 25, where they went to 4 series in May 25. then followed by opus 4.1 in august, sonnet 4.5 in oct, opus 4.5 in nov, 4.6 in feb, sonnet 4.6 in feb itself. Yes, they released both within weeks of each other, but originally they only released it together. This staggered release is what creates the impression of fast releases. its as much a function of training as a function of available compute, and they have ramped up in that regard.

oliveiracwb 3 hours ago | parent | prev | next [-]

With the advent of MoEs, efficiency gains became possible. However, MoEs still operate far from the balance and stability of dense models. My view is that most progress comes from router tuning based on good and bad outcomes, with only marginal gains in real intelligence

nikcub 3 hours ago | parent | prev | next [-]

and anyone notice that the pace has broken xAI and they were just dropped behind? The frontier improvement release loop is now ant -> openai -> google

gavinray 2 hours ago | parent | next [-]

xAI just released Grok 4.20 beta yesterday or day before?

dist-epoch 2 hours ago | parent | prev [-]

Musk said Grok 5 is currently being trained, and it has 7 trillion params (Grok 4 had 3)

svara an hour ago | parent [-]

My understanding is that all recent gains are from post training and no one (publicly) knows how much scaling pretraining will still help at this point.

Happy to learn more about this if anyone has more information.

dist-epoch 31 minutes ago | parent [-]

You gain more benefit spending compute on post-training than on pre-training.

But scaling pre-training is still worth it if you can afford it.

gmerc 3 hours ago | parent | prev | next [-]

That's what scaling compute depth to respond to the competition look like, lighting those dollars on fire.

toephu2 2 hours ago | parent | prev | next [-]

This is what competition looks like.

PlatoIsADisease 4 hours ago | parent | prev | next [-]

Only using my historical experience and not Gemini 3.1 Pro, I think we see benchmark chasing then a grand release of a model that gets press attention...

Then a few days later, the model/settings are degraded to save money. Then this gets repeated until the last day before the release of the new model.

If we are benchmaxing this works well because its only being tested early on during the life cycle. By middle of the cycle, people are testing other models. By the end, people are not testing them, and if they did it would barely shake the last months of data.

3 hours ago | parent | prev | next [-]
[deleted]
boxingdog 3 hours ago | parent | prev [-]

[dead]