Remix.run Logo
NitpickLawyer a day ago

> and that LLMs have basically reached a plateau

This is the new stochastic parrots meme. Just a few hours ago there was a story on the front page where an LLM based "agent" was given 3 tools to search e-mails and the simple task "find my brother's kid's name", and it was able to systematically work the problem, search, refine the search, and infer the correct name from an e-mail not mentioning anything other than "X's favourite foods" with a link to a youtube video. Come on!

That's not to mention things like alphaevolve, microsoft's agentic test demo w/ copilot running a browser, exploring functionality and writing playright tests, and all the advances in coding.

sensanaty a day ago | parent | next [-]

And we also have a showcase from a day ago [1] of these magical autonomous AI agents failing miserably in the PRs unleashed on the dotnet codebase, where it kept reiterating it fixed tests it wrote that failed without fixing them. Oh, and multiple blatant failures that happened live on stage [2], with the speaker trying to sweep the failures under the rug on some of the simplest code imaginable.

But sure, it managed to find a name buried in some emails after being told to... Search through emails. Wow. Such magic

[1] https://news.ycombinator.com/item?id=44050152 [2] https://news.ycombinator.com/item?id=44056530

hsn915 a day ago | parent | prev | next [-]

Is this something that the models from 4 months ago were not able to do?

vessenes 9 hours ago | parent [-]

For a fair definition of able, yes. Those models had no ability to engage in a search of emails.

What’s special about it is that it required no handholding; that is new.

camdenreslink 9 hours ago | parent [-]

Is this because the models improved, or the tooling around models improved (both visible and not visible to the end user).

My impression is that the base models have not improved dramatically in the last 6 months and incremental improvements in those models is becoming extremely expensive.

morepedantic a day ago | parent | prev [-]

The LLMs have reached a plateau. Successive generations will be marginally better.

We're watching innovation move into the use and application of LLMs.

the8472 9 hours ago | parent [-]

Innovation and better application of a relatively fixed amount of intelligence got us from wood spears to the moon.

So even if the plateau is real (which I doubt given the pace of new releases and things like AlphaEvolve) and we'd only expect small fundamental improvements some "better applications" could still mean a lot of untapped potential.