The first premise of the argument is that LLMs are plateauing in capability and this is obvious from using them. It is not obvious to me.

▲

gwern 3 days ago | parent | next [-]

It is especially not obvious because this was written using ChatGPT-5. One appreciates the (deliberate?) irony, at least. (Or at least, surely if they had asymptoted, OP should've been able to write this upvoted HN article with an old GPT-4, say...)

▲

mdp2021 3 days ago | parent [-]

> this was written using

How do you know?

▲

gwern 3 days ago | parent | next [-]

It is lacking in URLs or references. (The systematic error in the self-reference blog post URLs is also suspicious: outdated system prompt? If nothing else, shows the human involved is sloppy when every link is broken.) The assertions are broadly cliche and truisms, and the solutions are trendy buzzwords from a year ago or more (consistent with knowledge cutoffs and emphasizing mainstream sources/opinions). The tricolon and unordered bolded triplet lists are ChatGPT. The em dashes (which you should not need to be told about at this point) and it's-not-x-but-y formulation are extremely blatant, if not 4o-level, and lacking emoji or hyperbolic language; hence, it's probably GPT-5. (Sub-GPT-5 ChatGPTs would also generally balk at talking about a 'GPT-5' because they think it doesn't exist yet.) I don't know if it was 100% GPT-5-written, but I do note that when I try the intro thesis paragraph on GPT-5-Pro, it dislikes it, and identifies several stupid assertions (eg. the claim that power law scaling has now hit 'diminishing returns', which is meaningless because all log or power laws always have diminishing returns), so probably not completely-GPT-5-written (or least, sub-Pro).

▲

mdp2021 3 days ago | parent [-]

> when I try the intro thesis paragraph on GPT-5-Pro, it dislikes it

I don't know about GPT-5-Pro, but LLMs can dislike their own output (when they work well...).

▲

gwern 3 days ago | parent [-]

They can, but they are known to have a self-favoring bias, and in this case, the error is so easily identified that it raises the question of why GPT-5 would both come up with it & preserve it when it can so easily identify it; while if that was part of OP's original inputs (whatever those were) it is much less surprising (because it is a common human error and mindlessly parroted in a lot of the 'scaling has hit a wall' human journalism).

▲

Foreignborn 3 days ago | parent [-]

do you have a source?

when i’ve done toy demos where GPT5, sonnet 4 and gemini 2.5 pro critique/vote on various docs (eg PRDs) they did not choose their own material more often than not.

my setup wasn’t intended to benchmark though so could be wrong over enough iterations.

	▲	gwern 3 days ago \| parent [-]
		I don't have any particularly canonical reference I'd cite here, but self-preference bias in LLMs is well-established. (Just search Arxiv.)

▲

alangou 3 days ago | parent | prev [-]

My favorite tell-tale sign:

> The gap isn’t just quantitative—it’s qualitative.

> LLMs don’t have memory—they engage in elaborate methods to fake it...

> This isn’t just database persistence—it’s building memory systems that evolve the way human memory does...

> The future isn’t one model to rule them all—it’s hundreds or thousands of specialized models working together in orchestrated workflows...

> The future of AGI is architectural, not algorithmic.

▲

netrem 3 days ago | parent | prev | next [-]

Consensus on GPT-5 has been that it was underwhelming, and definitely a smaller jump than 3 to 4.

	▲	dcre 2 days ago \| parent [-]
		I understand that is what a lot of people are saying. It doesn’t match my experience.

▲

taormina 3 days ago | parent | prev [-]

Just ancedata, but they keep releasing new versions and it keeps not being better. What would you describe this as if not plateauing? Worsening?

▲

danenania 3 days ago | parent [-]

I see a lot of people saying things like this, and I’m not really sure which planet you all are living on. I use LLMs nearly every day, and they clearly keep getting better.

▲

taormina 2 days ago | parent [-]

Grok hasn't gotten better. OpenAI hasn't gotten better. Claude Code with Opus and Sonnet I swear are getting actively worse. Maybe you only use them for toy projects, but attempting to get them to do real work in my real codebase is an exercise in frustration. Yes, I've done meaningful prompting work, and I've set up all the CLAUDE.md files, and then it proceeds to completely ignores everything I said, all of the context I gave, and just craps out something completely useless. It has accomplished a small amount of meaningful work, exactly enough that I think I'm neutral instead of in the negative in terms of work:time if I have just done it all myself.

I get to tell myself that it's worth it because at least I'm "keeping up with the industry" but I honestly just don't get the hype train one bit. Maybe I'm too senior? Maybe the frameworks I use, despite being completely open source and available as training data for every model on the planet are too esoteric?

And then the top post today on the front page is telling me that my problem is that I'm bothering to supervise and that I should be writing an agent framework so that it can spew out the crap in record time..... But I need to know what is absolute garbage and what needs to be reverted. I will admit that my usual pattern has been to try and prompt it into better test coverage/specific feature additions/etc on the nights and weekends, and then I focus my daytime working hours on reviewing what was produced. About half the time I review it and have to heavily clean it up to make it usable, but more often than not, I revert the whole thing and just start on it myself from scratch. I don't see how this counts as "better".

▲

danenania 2 days ago | parent [-]

It can definitely be difficult and frustrating to try to use LLMs in a large codebase—no disagreement there. You have to be very selective about the tasks you give them and how they are framed. And yeah, you often need to throw away what they produced when they go in the wrong direction.

None of that means they’re getting worse though. They’re getting better; they’re just not as good as you want them to be.

	▲	taormina 2 days ago \| parent [-]
		I mean, this really isn't a large codebase, this is a small-medium sized codebase as judged by prior jobs/projects. 9000 lines of code? When I give them the same task I tried to give them the day before, and the output gets noticeably worse than their last model version, is that better? When the day by day performance feels like it's degrading? They are definitely not as good as I would like them to be but that's to be expected of professionals who beg for money hyping them up.