Remix.run Logo
ausbah 7 days ago

those are really good points, but LLMs have really started to plateau off on their capabilities haven’t they? the improvements from gpt2 class models to 3 was much bigger then 3 to 4, which was only somewhat bigger than 4 to 5

most of the vibe shift I think I’ve seen in the past few months to using LLMs in the context of coding has been improvements in dataset curation and ux, not fundamentally better tech

worldsayshi 7 days ago | parent | next [-]

> LLMs have really started to plateau

That doesn't seem unexpected. Any technological leap seem to happen in sigmoid-like steps. When a fruitful approach is discovered we run to it until diminishing returns sets in. Often enough a new approach opens doors to other approaches that builds on it. It takes time to discover the next step in the chain but when we do we get a new sigmoid-like leap. Etc...

worldsayshi 7 days ago | parent [-]

Personally my bet for the next fruitful step is something in line with what Victor Taelin [1] is trying to achieve.

I.e. combining new approaches around old school "AI" with GenAI. That's probably not exactly what he's trying to do but maybe somewhere in the ball park.

1 - https://x.com/victortaelin

bigstrat2003 7 days ago | parent | prev | next [-]

Started? In my opinion they haven't gotten better since the release of ChatGPT a few years ago. The weaknesses are still just as bad, the strengths have not improved. Which is why I disagree with the hype saying they'll get better still. They don't do the things they are claimed to today, and haven't gotten better in the last few years. Why would I believe that they'll achieve even higher goals in the future?

Closi 6 days ago | parent [-]

I assume you don’t use these models frequently, because there is a staggering difference in response quality from frontier LLMs compared to GPT 3.

Go open the OpenAI API playground and give GPT3 and GPT5 the same prompt to make a reasonably basic game in JavaScript to your specification and watch GPT 3 struggle and GPT 5 one-shot it.

globular-toast 6 days ago | parent | next [-]

Sure but it's kinda like a road then never quite gets you anywhere. It seems to get closer and closer to the next town all the time, but ultimately it's still not there yet and that's all that really matters.

chrz 6 days ago | parent | prev [-]

Theyre faster, shinier, get lost less but still doesnt fly

DanielHB 7 days ago | parent | prev | next [-]

All the other things he mentioned didn't rely on breakthroughs, LLMs really do seem to have reached a plateau and need a breakthrough to push along to the next step.

Thing is breakthroughs are always X years away (50 for fusion power for example).

The only example he gave that actually was kind of a big deal was mobile phones where capacitive touchscreens really did catapult the technology forward. But it is not like celphones weren't already super useful, profitable and getting better over time before capacitive touchscreens were introduced.

Maybe broadband to the internet also qualifies.

Closi 6 days ago | parent [-]

> All the other things he mentioned didn't rely on breakthroughs, LLMs really do seem to have reached a plateau and need a breakthrough to push along to the next step.

I think a lot of them relied on gradual improvement and lots of 'mini-breakthroughs' rather than one single breakthrough that changes everything. These mini-breakthroughs took decades to realise themselves properly in almost every example on the list too, not just a couple of years.

My personal gut feel is that even if the core technology plateau's, there's still lots of iterative improvement to go after on the productisation/commercialisation of the existing technology (e.g. improving tooling/ui/applying it to solving real problems/productising current research etc).

In electric car terms - we are still at the stage where Tesla is shoving batteries in a lotus elise, rather than releasing the model 3. We might have the lithium polymer batteries, but there's still lots of work to do to pull it into the final product.

(Having said this - I don't think the technology has plateau'd - I think we are just looking at it across a very narrow time span. If in 1979 you said that computers had plateau'd in 1979 because there hadn't been much progress in the last 12 months they would have been very wrong - breakthrough's sometimes take longer as technology matures, but that doesn't mean that the technology two decades from now won't be substantially different.

imtringued 6 days ago | parent | prev | next [-]

There also is an absolutely massive gap between Llama 2 and Llama 3. The Llama 3.1 models represent the beginning of usable open weight models. Meanwhile Llama 4 and its competitors seem to be incremental improvements.

Yes, the newest models are so much better that they obsolete the old ones, but now the biggest differences between models is primarily what they know (parameter count and dataset quality) and how much they spend thinking (compute budget).

stpedgwdgfhgdd 7 days ago | parent | prev | next [-]

There is a big difference between Claude Code today and 6 months ago. Perhaps the LLMs plateau, but the tooling not.

NitpickLawyer 7 days ago | parent | prev | next [-]

> but LLMs have really started to plateau off on their capabilities haven’t they?

Uhhh, no?

In the past month we've had:

- LLMs (3 different models) getting gold at IMO

- gold at IoI

- beat 9/10 human developers at atcode heuristics (optimisations problems) with the single human that actually beat the machine saying he was exhausted and next year it'll probably be over.

- agentic that actually works. And works for 30-90 minute sessions while staying coherent and actually finishing tasks.

- 4-6x reduction in price for top tier (SotA?) models. oAI's "best" model now costs 10$/MTok, while retaining 90+% of their previous SotA models that were 40-60$/MTok.

- several "harnesses" being released by every model provider. Claude code seems to remain the best, but alternatives are popping off everywhere - geminicli, opencoder, qwencli (forked, but still), etc.

- opensource models that are getting close to SotA, again. Being 6-12months behind (depending on who you ask), opensource and cheap to run (~2$/MTok on some providers).

I don't see the plateauing in capabilities. LLMs are plateauing only in benchmarks, where number goes up can only go up so far until it becomes useless. IMO regular benchmarks have become useless. MMLU & co are cute, but agentic whatever is what matters. And those capabilities have only improved. And will continue to improve, with better data, better signals, better training recipes.

Why do you think eveyr model provider is heavily subsidising coding right now? They all want that sweet sweet data & signals, so they can improve their models.

tripzilch 3 days ago | parent [-]

> I don't see the plateauing in capabilities. LLMs are plateauing only in benchmarks

Don't you mean the opposite? Like, it beat an IMO, which is a benchmark, but it's nowhere remotely close to having any of even the basic mathematical capabilities someone who beat an IMO can be expected to have.

Like being unable to deal with negations ... or not getting confused by a question being stated in something other than their native alphabet ...

cameronh90 7 days ago | parent | prev [-]

I'm not sure I'd describe it as a plateau. It might be, but I'm not convinced. Improvements are definitely not as immediately obvious now, but how much of that is due to it being more difficult to accurately gauge intelligence above a certain point? Or even that the marginal real life utility of intelligence _itself_ starts to plateau?

A (bad) analogy would be that I can pretty easily tell the difference between a cat and an ape, and the differences in capability are blatantly obvious - but the improvement when going from IQ 70 to Einstein are much harder to assess and arguably not that useful for most tasks.

I tend to find that when I switch to a new model, it doesn't seem any better, but then at some point after using it for a few weeks I'll try to use the older model again and be quite surprised at how much worse it is.