Remix.run Logo
NitpickLawyer 3 days ago

I've been hearing this a lot, but I kinda disagree. They aren't plateauing IMO, they are getting better at new things, and that enables new capabilities. This doesn't often show in the traditional benchmarks (which are becoming less and less useful indicators of capabilities).

Take gemini 2.5 for example. It has enormous useful context. There were gimmicks before, but the usefulness dropped like a stone after 30-40k tokens. Now they work even with 100+k tokens, and do useful tasks at those lengths.

The agentic stuff is also really getting better. 4.1-nano can now do stuff that sonnet 3.5 + a lot of glue couldn't do a year ago. That's amazing, imo. We even see that with open models. Devstral has been really impressive for its size, and I hear good things about the qwen models, tho I haven't yet tried them.

There's also proof that the models themselves are getting better at raw agentic stuff (i.e. they generalise). The group that released swe-agent recently released mini-swe-agent, a ~100 LoC harness that runs Claude4 in a loop with no tools other than "terminal". And they still get to within 10% of their much larger, tool supporting, swe-agent harness on swe-bench.

I don't see the models plateauing. I think our expectations are overinflated.