▲ | ausbah 7 days ago | ||||||||||||||||||||||
those are really good points, but LLMs have really started to plateau off on their capabilities haven’t they? the improvements from gpt2 class models to 3 was much bigger then 3 to 4, which was only somewhat bigger than 4 to 5 most of the vibe shift I think I’ve seen in the past few months to using LLMs in the context of coding has been improvements in dataset curation and ux, not fundamentally better tech | |||||||||||||||||||||||
▲ | worldsayshi 7 days ago | parent | next [-] | ||||||||||||||||||||||
> LLMs have really started to plateau That doesn't seem unexpected. Any technological leap seem to happen in sigmoid-like steps. When a fruitful approach is discovered we run to it until diminishing returns sets in. Often enough a new approach opens doors to other approaches that builds on it. It takes time to discover the next step in the chain but when we do we get a new sigmoid-like leap. Etc... | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | bigstrat2003 7 days ago | parent | prev | next [-] | ||||||||||||||||||||||
Started? In my opinion they haven't gotten better since the release of ChatGPT a few years ago. The weaknesses are still just as bad, the strengths have not improved. Which is why I disagree with the hype saying they'll get better still. They don't do the things they are claimed to today, and haven't gotten better in the last few years. Why would I believe that they'll achieve even higher goals in the future? | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | DanielHB 7 days ago | parent | prev | next [-] | ||||||||||||||||||||||
All the other things he mentioned didn't rely on breakthroughs, LLMs really do seem to have reached a plateau and need a breakthrough to push along to the next step. Thing is breakthroughs are always X years away (50 for fusion power for example). The only example he gave that actually was kind of a big deal was mobile phones where capacitive touchscreens really did catapult the technology forward. But it is not like celphones weren't already super useful, profitable and getting better over time before capacitive touchscreens were introduced. Maybe broadband to the internet also qualifies. | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | imtringued 6 days ago | parent | prev | next [-] | ||||||||||||||||||||||
There also is an absolutely massive gap between Llama 2 and Llama 3. The Llama 3.1 models represent the beginning of usable open weight models. Meanwhile Llama 4 and its competitors seem to be incremental improvements. Yes, the newest models are so much better that they obsolete the old ones, but now the biggest differences between models is primarily what they know (parameter count and dataset quality) and how much they spend thinking (compute budget). | |||||||||||||||||||||||
▲ | stpedgwdgfhgdd 7 days ago | parent | prev | next [-] | ||||||||||||||||||||||
There is a big difference between Claude Code today and 6 months ago. Perhaps the LLMs plateau, but the tooling not. | |||||||||||||||||||||||
▲ | NitpickLawyer 7 days ago | parent | prev | next [-] | ||||||||||||||||||||||
> but LLMs have really started to plateau off on their capabilities haven’t they? Uhhh, no? In the past month we've had: - LLMs (3 different models) getting gold at IMO - gold at IoI - beat 9/10 human developers at atcode heuristics (optimisations problems) with the single human that actually beat the machine saying he was exhausted and next year it'll probably be over. - agentic that actually works. And works for 30-90 minute sessions while staying coherent and actually finishing tasks. - 4-6x reduction in price for top tier (SotA?) models. oAI's "best" model now costs 10$/MTok, while retaining 90+% of their previous SotA models that were 40-60$/MTok. - several "harnesses" being released by every model provider. Claude code seems to remain the best, but alternatives are popping off everywhere - geminicli, opencoder, qwencli (forked, but still), etc. - opensource models that are getting close to SotA, again. Being 6-12months behind (depending on who you ask), opensource and cheap to run (~2$/MTok on some providers). I don't see the plateauing in capabilities. LLMs are plateauing only in benchmarks, where number goes up can only go up so far until it becomes useless. IMO regular benchmarks have become useless. MMLU & co are cute, but agentic whatever is what matters. And those capabilities have only improved. And will continue to improve, with better data, better signals, better training recipes. Why do you think eveyr model provider is heavily subsidising coding right now? They all want that sweet sweet data & signals, so they can improve their models. | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | cameronh90 7 days ago | parent | prev [-] | ||||||||||||||||||||||
I'm not sure I'd describe it as a plateau. It might be, but I'm not convinced. Improvements are definitely not as immediately obvious now, but how much of that is due to it being more difficult to accurately gauge intelligence above a certain point? Or even that the marginal real life utility of intelligence _itself_ starts to plateau? A (bad) analogy would be that I can pretty easily tell the difference between a cat and an ape, and the differences in capability are blatantly obvious - but the improvement when going from IQ 70 to Einstein are much harder to assess and arguably not that useful for most tasks. I tend to find that when I switch to a new model, it doesn't seem any better, but then at some point after using it for a few weeks I'll try to use the older model again and be quite surprised at how much worse it is. |