| ▲ | esafak 4 hours ago | |||||||||||||||||||||||||||||||
Has anyone noticed that models are dropping ever faster, with pressure on companies to make incremental releases to claim the pole position, yet making strides on benchmarks? This is what recursive self-improvement with human support looks like. | ||||||||||||||||||||||||||||||||
| ▲ | emp17344 4 hours ago | parent | next [-] | |||||||||||||||||||||||||||||||
Remember when ARC 1 was basically solved, and then ARC 2 (which is even easier for humans) came out, and all of the sudden the same models that were doing well on ARC 1 couldn’t even get 5% on ARC 2? Not convinced these benchmark improvements aren’t data leakage. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
| ▲ | redox99 4 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
I don't think there's much recursive improvement yet. I'd say it's a combination of A) Before, new model releases were mostly a new base model trained from scratch, with more parameters and more tokens. This takes many Months. Now that RL is used so heavily, you can make infinitely many tweaks to the RL setup, and in just a month get a better model using the same base model. B) There's more compute online C) Competition is more fierce. | ||||||||||||||||||||||||||||||||
| ▲ | m_ke 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
this is mostly because RLVR is driving all of the recent gains, and you can continue improving the model by running it longer (+ adding new tasks / verifiers) so we'll keep seeing more frequent flag planting checkpoint releases to not allow anyone to be able to claim SOTA for too long | ||||||||||||||||||||||||||||||||
| ▲ | ankit219 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
not much to do with self improvement as such. openai has increased its pace, others are pretty much consistent. Google last year had three versions of gemini-2.5-pro each within a month of each other. Anthropic released claude 3 in march 24, sonnet 3.5 in june 24, 3.5 new in oct 24, and then 3.7 in feb 25, where they went to 4 series in May 25. then followed by opus 4.1 in august, sonnet 4.5 in oct, opus 4.5 in nov, 4.6 in feb, sonnet 4.6 in feb itself. Yes, they released both within weeks of each other, but originally they only released it together. This staggered release is what creates the impression of fast releases. its as much a function of training as a function of available compute, and they have ramped up in that regard. | ||||||||||||||||||||||||||||||||
| ▲ | oliveiracwb 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
With the advent of MoEs, efficiency gains became possible. However, MoEs still operate far from the balance and stability of dense models. My view is that most progress comes from router tuning based on good and bad outcomes, with only marginal gains in real intelligence | ||||||||||||||||||||||||||||||||
| ▲ | nikcub 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
and anyone notice that the pace has broken xAI and they were just dropped behind? The frontier improvement release loop is now ant -> openai -> google | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
| ▲ | gmerc 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
That's what scaling compute depth to respond to the competition look like, lighting those dollars on fire. | ||||||||||||||||||||||||||||||||
| ▲ | toephu2 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
This is what competition looks like. | ||||||||||||||||||||||||||||||||
| ▲ | PlatoIsADisease 4 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
Only using my historical experience and not Gemini 3.1 Pro, I think we see benchmark chasing then a grand release of a model that gets press attention... Then a few days later, the model/settings are degraded to save money. Then this gets repeated until the last day before the release of the new model. If we are benchmaxing this works well because its only being tested early on during the life cycle. By middle of the cycle, people are testing other models. By the end, people are not testing them, and if they did it would barely shake the last months of data. | ||||||||||||||||||||||||||||||||
| ▲ | 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
| [deleted] | ||||||||||||||||||||||||||||||||
| ▲ | boxingdog 3 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||
[dead] | ||||||||||||||||||||||||||||||||