Remix.run Logo
aubanel 3 days ago

The premise "LLMs have reached a plateau" is false IMO.

Here are the metrics by which the author defines this plateau: "limited by their inability to maintain coherent context across sessions, their lack of persistent memory, and their stochastic nature that makes them unreliable for complex multi-step reasoning."

If you try to benchmark any proxy of the points above, for instance "can models solve problems that require multi steps in agentic mode" (PlanBench, BrowseComp, I've even built custom benchmarks), the progress between models is very clear, and shows no sign of slowing down.

And this does convert to real-world tasks : yesterday, I had GPT-5 build me complex react charts in one-shot, whereas previous models needed more constant supervision.

I think we're moving goalposts too fast for LLMs, that's what can lead us to believe they've plateaued : but just try using past models for your current tasks (you can use use open models to be sure they were not updated) and see them struggle.