Remix clone Hacker News

I can't be the only one who thinks this version is no better than the previous one, and that LLMs have basically reached a plateau, and all the new releases "feature" are more or less just gimmicks.

▲

TechDebtDevin a day ago | parent | next [-]

I think they are just getting better at the edges, MCP/Tool Calls, structured output. This definitely isn't increased intelligence, but it an increase in the value add, not sure the value added equates to training costs or company valuations though.

In all reality, I have zero clue how any of these companies remain sustainable. I've tried to host some inference on cloud GPUs and its seems like it would be extremely cost prohibitive with any sort of free plan.

▲

layoric a day ago | parent | next [-]

> how any of these companies remain sustainable

They don't, they have a big bag of money they are burning through, and working to raise more. Anthropic is in a better position cause they don't have the majority of the public using their free-tier. But, AFAICT, none of the big players are profitable, some might get there, but likely through verticals rather than just model access.

▲

tymscar a day ago | parent | next [-]

Doesn’t this mean that realistically even if “the bubble never pops”, at some point money will run dry?

Or do these people just bet on the post money world of AI?

▲

Aeolun a day ago | parent | next [-]

The money won’t run dry. They’ll just stop providing a free plan when the marginal benefits of having one don’t outweigh the costs any more.

▲

fy20 a day ago | parent | next [-]

In two years time you'll need to add an 10% Environmental Tax, 25% Displaced Workers Tax, and 50% tip to your OpenAI bills.

	▲	FridgeSeal 18 hours ago \| parent [-]
		Or at that point, maybe stop using it and just let them go broke?

▲

Iolaum 15 hours ago | parent | prev [-]

It's more likely that the free tier model will be a distilled lower parameter count model that will be cheap enough to run.

▲

layoric 19 hours ago | parent | prev [-]

They will likely just charge a lot more money for these services. Eg, the $200+ per months I think could become more of the entry level in 3-5 years. Saying that smaller models are getting very good, so there could be low margin direct model services and expensive verticals IMO.

	▲	AstroBen 8 hours ago \| parent [-]
		At that price it would start to be worth it to set up your own hardware and run local open source models

▲

KennyBlanken 7 hours ago | parent | prev [-]

If your house is on fire, the fact that the village are throwing firewood through the windows doesn't really mean the house will stay standing longer.

▲

hijodelsol 17 hours ago | parent | prev | next [-]

If you read any work from Ed Zitron [1], they likely cannot remain sustainable. With OpenAI failing to convert into a for-profit, Microsoft being more interested in being a multi-modal provider and competing openly with OpenAI (e.g., open-sourcing Copilot vs. Windsurf, GitHub Agent with Claude as the standard vs. Codex) and Google having their own SOTA models and not relying on their stake in Anthropic, tarrifs complicating Stargate, explosion in capital expenditure and compute, etc., I would not be surprised to see OpenAI and Anthropic go under in the next years.

1: https://www.wheresyoured.at/oai-business/

▲

vessenes 9 hours ago | parent | next [-]

I see this sentiment everywhere on hacker news. I think it’s generally the result of consuming the laziest journalism out there. But I could be wrong! Are you interested in making a long bet banking your prediction? I’m interested in taking the positive side on this.

	▲	hijodelsol 3 hours ago \| parent [-]
		While some critical journalism may be simplistic, I would not qualify it as lazy. Much of it is deeply nuanced and detail-oriented. To me, lazy would be publications regurgitating the statements of CEOs and company PR people who have a vested interest in making their product seem appealing. Since most of the hype is based on perceived futures, benchmarks, or the automation of the easier half of code development, I consider the simplistic voices asking "Where is the money?" to be important because most people seem to neglect the fundamental business aspects of this sector. I am someone who works professionally in ML (though not LLM development itself) and deploys multiple RAG- and MCP-powered LLM apps in side businesses. I code with Copilot, Gemini, and Claude and read and listen to most AI-industry outputs, be they company events, papers, articles, MSM reports, the Dwarkesh podcast, MLST, etc. While I acknowledge some value, having closely followed the field and extensively used LLMs, I find the company's projections and visions deeply unconvincing and cannot identify the trillion-dollar value. While I never bet for money and don't think everything has to be transactional or competitive, I would bet on defining terms and recognizing if I'm wrong. What do you mean by taking the positive side? Do you think OpenAI's revenue projections are realistic and will be achieved or surpassed by competing in the open market (i.e., excluding purely political capture)? Betting on the survival of the legal entity would likely not be the right endpoint because OpenAI could likely be profitable with a small team if it restricted itself to serving only GPT 4.1 mini and did not develop anything new. They could also be acquired by companies with deeper pockets that have alternative revenue streams. But I am highly convinced that OpenAI will not have a revenue of > 100 billion by 2029 while being profitable [1] and willing to take my chances. 1: https://www.reuters.com/technology/artificial-intelligence/o...

▲

viraptor 13 hours ago | parent | prev [-]

There's still the question of whether they will try to change the architecture before they die. Using RWKV (or something similar) would drop the costs quite a bit, but will require risky investment. On the other hand some experiment with diffusion text already, so it's slowly happening.

▲

yahoozoo a day ago | parent | prev [-]

https://www.wheresyoured.at/reality-check/

▲

holoduke 14 hours ago | parent [-]

This man (in the article) clearly hates AI. I also think he does not understand business and is not really able to predict the future.

▲

sameermanek 13 hours ago | parent [-]

But he did make good points though. AI was perceived more dangerous when only select few mega corps (usually backing each other) were pushing its capabilities.

But now, every $50B+ company seems to have their own model. Chinese companies have an edge in local models and the big tech seems to be fighting each other like cats and dogs for a tech which has failed to generate any profit while masses are draining the cash out from the companies with free usage and ghiblis.

What is the concrete business model here? Someone at google said "we have no moat" and i guess he was right, this is becoming more and more like a commodity.

▲

JohnPrine 11 hours ago | parent [-]

oil is a commodity, and yet the oil industry is massive and has multiple major players

	▲	notfromhere 7 hours ago \| parent [-]
		also was kind of a shit investment unless you figured out which handful of companies were gonna win.

▲

NitpickLawyer a day ago | parent | prev | next [-]

> and that LLMs have basically reached a plateau

This is the new stochastic parrots meme. Just a few hours ago there was a story on the front page where an LLM based "agent" was given 3 tools to search e-mails and the simple task "find my brother's kid's name", and it was able to systematically work the problem, search, refine the search, and infer the correct name from an e-mail not mentioning anything other than "X's favourite foods" with a link to a youtube video. Come on!

That's not to mention things like alphaevolve, microsoft's agentic test demo w/ copilot running a browser, exploring functionality and writing playright tests, and all the advances in coding.

▲

sensanaty a day ago | parent | next [-]

And we also have a showcase from a day ago [1] of these magical autonomous AI agents failing miserably in the PRs unleashed on the dotnet codebase, where it kept reiterating it fixed tests it wrote that failed without fixing them. Oh, and multiple blatant failures that happened live on stage [2], with the speaker trying to sweep the failures under the rug on some of the simplest code imaginable.

But sure, it managed to find a name buried in some emails after being told to... Search through emails. Wow. Such magic

[1] https://news.ycombinator.com/item?id=44050152 [2] https://news.ycombinator.com/item?id=44056530

▲

hsn915 a day ago | parent | prev | next [-]

Is this something that the models from 4 months ago were not able to do?

▲

vessenes 9 hours ago | parent [-]

For a fair definition of able, yes. Those models had no ability to engage in a search of emails.

What’s special about it is that it required no handholding; that is new.

	▲	camdenreslink 8 hours ago \| parent [-]
		Is this because the models improved, or the tooling around models improved (both visible and not visible to the end user). My impression is that the base models have not improved dramatically in the last 6 months and incremental improvements in those models is becoming extremely expensive.

▲

morepedantic a day ago | parent | prev [-]

The LLMs have reached a plateau. Successive generations will be marginally better.

We're watching innovation move into the use and application of LLMs.

	▲	the8472 9 hours ago \| parent [-]
		Innovation and better application of a relatively fixed amount of intelligence got us from wood spears to the moon. So even if the plateau is real (which I doubt given the pace of new releases and things like AlphaEvolve) and we'd only expect small fundamental improvements some "better applications" could still mean a lot of untapped potential.

▲

strangescript a day ago | parent | prev | next [-]

I have used claude code a ton and I agree, I haven't noticed a single difference since updating. Its summaries I guess a little cleaner, but its has not surprised me at all in ability. I find I am correcting it and re-prompting it as much as I didn't with 3.7 on a typescript codebase. In fact I was kind of shocked how badly it did in a situation where it was editing the wrong file and it never thought to check that more specifically until I forced it to delete all the code and show that nothing changed with regards to what we were looking at.

▲

hsn915 a day ago | parent [-]

I'd go so far as to say Sonnet 3.5 was better than 3.7

At least I personally liked it better.

	▲	vessenes 9 hours ago \| parent [-]
		I also liked it better but the aider leaderboards are clear that 3.7 was better. I found it extremely over eager as a coding agent but my guess is that it needed different prompting than 3.6

▲

jug 15 hours ago | parent | prev | next [-]

This is my feeling too, across the board. Nowadays, benchmark wins seem to come from tuning, but then causing losses in other areas. o3, o4-mini also hallucinates more than o1 in SimpleQA, PersonQA. Synthetic data seems to cause higher hallucination rates. Reasoning models at even higher risk due to hallucinations risking to throw the model off track at each reasoning step.

LLM’s in a generic use sense are done since already earlier this year. OpenAI discovered this when they had to cancel GPT-5 and later released the ”too costly for gains” GPT-4.5 that will be sunset soon.

I’m not sure the stock market has factored all this in yet. There needs to be a breakthrough to get us past this place.

▲

voiper1 a day ago | parent | prev | next [-]

The benchmarks in many ways seem to be very similar to claude 3.7 for most cases.

That's nowhere near enough reason to think we've hit a plateau - the pace has been super fast, give it a few more months to call that...!

I think the opposite about the features - they aren't gimmicks at all, but indeed they aren't part of the core AI. Rather it's important "tooling" that adjacent to the AI that we need to actually leverage it. The LLM field in popular usage is still in it's infancy. If the models don't improve (but I expect they will), we have a TON of room with these features and how we interact, feed them information, tool calls, etc to greatly improve usability and capability.

▲

fintechie 14 hours ago | parent | prev | next [-]

It's not that it isn't better, it's actually worse. Seems like the big guys are stuck on a race to overfit for benchmarks, and this is becoming very noticeable.

▲

sanex a day ago | parent | prev | next [-]

Well to be fair it's only .3 difference.

▲

pantsforbirds 7 hours ago | parent | prev | next [-]

It seems MUCH better at tool usage. Just had an example where I asked Sonnet 4 to split a PR I had after we had to revert an upstream commit.

I didn't want to lose the work I had done, and I knew it would be a pain to do it manually with git. The model did a fantastic job of iterating through the git commits and deciding what to put into each branch. It got everything right except for a single test that I was able to easily move to the correct branch myself.

▲

brookst a day ago | parent | prev | next [-]

How much have you used Claude 4?

▲

hsn915 a day ago | parent [-]

I asked it a few questions and it responded exactly like all the other models do. Some of the questions were difficult / very specific, and it failed in the same way all the other models failed.

▲

theptip 19 hours ago | parent [-]

Great example of this general class of reasoning failure.

“AI does badly on my test therefore it’s bad”.

The correct question to ask is, of course, what is it good at? (For bonus points, think in terms of $/task rather than simply being dominant over humans.)

▲

atworkc 15 hours ago | parent [-]

"AI does badly on my test much like other AI's did before it, therefore I don't immediately see much improvement" is a fair assumption.

▲

brookst 11 hours ago | parent | next [-]

No, it’s really not.

“I used an 8088 CPU to whisk egg whites, then an Intel core 9i-12000-vk4*, and they were equally mediocre meringues, therefore the latest Intel processor isn’t a significant improvement over one from 50 years ago”

* Bear with me, no idea their current naming

	▲	Kon-Peki 8 hours ago \| parent [-]
		You’re holding them wrong. An 8088 package should be able to emulate a whisk about a million times better than an i9.

▲

theptip 9 hours ago | parent | prev [-]

“Human can’t fly, much like other humans. Therefore it’s bad”

Spot the problem now?

AI capabilities are highly jagged, they are clearly superhuman in many dimensions, and laughably bad compared to humans in others.

▲

illegally 13 hours ago | parent | prev | next [-]

Yes.

They just need to put out a simple changelog for these model updates, no need to make a big announcement everytime to make it look like it's a whole new thing. And the version numbers are even worse.

▲

flixing a day ago | parent | prev | next [-]

i think you are.

▲

go_elmo a day ago | parent | prev | next [-]

I feel like the model making a memory file to store context is more than a gimmick, no?

▲

make3 a day ago | parent | prev [-]

the increases are not as fast, but they're still there. the models are already exceptionally strong, I'm not sure that basic questions can capture differences very well

▲

hsn915 a day ago | parent [-]

Hence, "plateau"

▲

j_maffe a day ago | parent | next [-]

"plateau" in the sense that your tests are not capturing the improvements. If your usage isn't using its new capabilities then for you then effectively nothing changed, yes.

▲

rxtexit 13 hours ago | parent | prev | next [-]

"I don't have anything to ask the model, so the model hasn't improved"

Brilliant!

I am pretty much ready to be done talking to human idiots on the internet. It is just so boring after talking to these models.

	▲	neal_ 12 hours ago \| parent [-]
		:p

▲

make3 21 hours ago | parent | prev [-]

plateau means stopped

	▲	camdenreslink 11 hours ago \| parent [-]
		It could mean improving more and more slowly all the time, approaching an asymptote.