I feel like the gap is closing to be able to run good enough models locally even for coding and I would assume it could make some companies a bit nervous. Am I wrong about that?

▲

UncleOxidant 6 hours ago | parent | next [-]

If we didn't have a RAM/GPU shortage right now they would be more nervous than they are. But as it is very few people are going to be able to afford a rig that can run this model effectively. That's probably not going to change for several more years yet. I think if the Z.ai folks decide to come out with a flash version of GLM-5.2 specialized for coding that came in about about 80B params, then the US frontier labs would probably be more worried. Overall, the Chinese AI companies have been showing the way to do the same amount with less (sometimes much less) and as that trend continues it's going to make the frontier labs worried - but even the Chinese AI companies are going to want to protect their moat by not releasing capable models that are significantly smaller than their current flagship models. AliBaba Qwen seems to be there now - it's gotten mighty quiet from them lately - their latest 395B model is just too large for most folks to run at home and they don't seem to be making any noises about releasing smaller ones this time around.

▲

gpm 5 hours ago | parent | next [-]

The ram/gpu shortage won't last forever though. Moreover we can be pretty confident that long-term the prices will obey wrights law and come down in cost significantly (from the pre-shortage prices) as we learn to produce them more efficiently.

LLM companies are valued as if they're going to have some enduring monopoly that they can extract money from... GLM-5.2 and similar models make that valuation very very questionable.

▲

UncleOxidant 5 hours ago | parent | next [-]

> The ram/gpu shortage won't last forever though.

No disagreement there, but it could easily last another 3 to 5 years which is a long time in tech terms.

	▲	DougN7 2 hours ago \| parent [-]
		Long enough for them to IPO and all the execs to retire. I doubt they care beyond the IPO.

▲

mannanj 5 hours ago | parent | prev [-]

> The ram/gpu shortage won't last forever though

Don't underestimate the markets ability to remain irrational

▲

colinsane 3 hours ago | parent [-]

the companies which have the power to alleviate these shortages are the same companies who are profiting most from the shortage. scarcity is an asset, it's not irrational that a concentrated marked will produce more of that asset.

	▲	selectodude 3 hours ago \| parent [-]
		The solution for high prices is high prices. If making RAM and SSDs is now cause for a 10 figure valuation, after enough time somebody will dive in.

▲

elorant 5 hours ago | parent | prev | next [-]

Very few people, but quite a lot of companies especially after per token pricing took effect and companies see their invoices skyrocketing. You pay an upfront cost once and you’re done.

▲

dannyw 2 hours ago | parent | prev | next [-]

When a large open weight model is released, a lab, startup, or a rich hoist can easily do logit-level distillation and create a XXb param model or whatever, and in theory you should get a really good distill.

▲

verdverm 5 hours ago | parent | prev [-]

I suspect the time horizon is shorter because of software advances. We are getting more capability out of smaller models

Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)

▲

UncleOxidant 3 hours ago | parent [-]

> Alibaba released Qwen 3.6 "tiny" models not that long ago, they punch way above their weight(s)

True, Qwen3.6-27B is amazing for it's size. However, it seems likely that we're not going to see anymore of these smaller models from Alibaba/Qwen since several key players exited that organization a few months back.

	▲	Infernal 3 hours ago \| parent \| next [-]
		Do we know where those key players went?
	▲	verdverm 2 hours ago \| parent \| prev [-]
		Good to know, I think the trend is clear based on the models coming out of China and well see more capabilities in smaller, more efficient models.

▲

simplyluke 4 hours ago | parent | prev | next [-]

You don't even need to run them locally for them to be a threat. Plenty of companies are looking at paying third party companies to host these models and they come in at fractions of the price of the frontier labs.

▲

cogman10 6 hours ago | parent | prev | next [-]

I don't think so. I could easily see a company deciding to host and run these models for their own development. If you have a dev team of about 10 people, a one time $50k investment in an LLM server has to be pretty tempting. Unlimited tokens, decent performance, upgrade options, and potential product integrations.

For companies wanting LLMs in their products in general, I have to think going the local llm route is even more tempting. Somewhat dumb models are more than good enough for a lot of the things people are integrating LLMs into their products.

▲

twelvechairs 6 hours ago | parent | next [-]

Surely for most the desire is just an LLM provider that doesnt store or sell their queries (including by national actors). As long as that is allowed to happen surely its the answer for the vast majority.

▲

eventualcomp 6 hours ago | parent | prev [-]

Where is $50k coming from again?

▲

stingraycharles 6 hours ago | parent | next [-]

That’s less than the monthly salary of 10 software engineers, and assuming they pay API prices, probably earns itself back in about a year.

Having said that, I don’t think it’s all that tempting for companies at all, considering this whole market is developing rapidly and it’s nearly impossible to predict where we’ll be at in a year or two.

▲

cogman10 6 hours ago | parent [-]

The hardware requirements aren't evolving and the local models have only been improving.

It's not like you'd lose capabilities, if anything this solution just gets better with time.

▲

chatmasta 5 hours ago | parent [-]

If the newer models require more/better hardware then you’ll lose capabilities.

I think you’re better off renting GPU instances and running all the software on those. It’ll be cheaper than Anthropic and OpenRouter but slightly more expensive than electricity and depreciation of hardware.

▲

cogman10 4 hours ago | parent [-]

The newer models don't require more/better hardware. There's a small army of local llm enthusiasts who are running LLMs using 3090s and H100s because they have lots of memory. Them being old isn't really that big of an issue as the compute power needed is relatively low all things considered.

The number of parameters needed for these open weight models has mostly stabilized so the actual memory requirements aren't likely to change all that much.

	▲	dannyw 2 hours ago \| parent [-]
		Correct. The main bottleneck with LLM inference is, and have always been, memory bandwidth. TPS = active weights in GB / your memory bandwidth. That’s it for decode. That’s all.

▲

cogman10 6 hours ago | parent | prev [-]

As in who pays for it or how did I arrive at that number?

For who pays for it, obviously the employer would.

For "how did I arrive at this number" Ballpark estimate from what I know about part cost. Most of that money will go towards AI cards about $5k for the mb, cpu, power supply, etc. $45k would be for as much ram and as big/expensive nVidia cards as you can get your hands on. The B300 has 288GB of VRAM in it. Probably what you'd be after.

▲

fny 6 hours ago | parent | prev | next [-]

The RAM requirements are still pretty painful.

▲

yieldcrv 6 hours ago | parent [-]

equilibrium in one or two more years on the consumer/prosumer side

think Apple M6 or M7 with a currently unforeseen denser memory style, 256gb RAM

a couple inference or cache improvements on the algorithmic side, using less ram for context windows and doubling token speed again

denser open source models, packing more experts for smaller active layers

it'll still be expensive but like $8,000 - $13,000 instead of $450,000 worth of B200s

▲

stingraycharles 6 hours ago | parent [-]

Fairly certain that model sizes and computational requirements will grow as the price for LLM compute drops.

	▲	3stacks 5 hours ago \| parent \| next [-]
		Maybe there's a conversation to be had about how much is enough... Unless something beyond my imagination happened, I would be happy enough with Opus 4.5 levels of productivity
	▲	yieldcrv 5 hours ago \| parent \| prev [-]
		have you seen the open source LLM space? people fulfill all niches and there are active communities at every range of RAM and all are looking for the most capable in their respective range a lot of innovation occurring

▲

fsuts 35 minutes ago | parent | prev | next [-]

Why do you think they are rushing to IPO!!

▲

scosman 3 hours ago | parent | prev | next [-]

It's not economic to run them locally. It's amazing for privacy, and fun hobby. But you're either looking at super slow CPU builds with $10k in RAM, $90k worth of GPUs, or a really quantized model that doesn't compare in quality.

I might build one for fun, but it's not going to change the economics alone. Still exciting it's possible.

▲

CamouflagedKiwi 6 hours ago | parent | prev | next [-]

The hardware requirements to run this locally are still very high. Seems far enough off mainstream for those companies not to be too worried yet.

▲

notatoad 5 hours ago | parent | prev | next [-]

locally on what hardware? something like the new dgx spark, ryzen halo, or mac studio will cost you ~ $4k plus whatever you pay for power. at the rate AI is currently progressing, i think you'd be optimistic to consider that as having a 2 year depreciation.

for $4k, you can get 20 months of claude max 200. i'd take claude over the hardware.

anthropic will have something to worry about when you can run a local model on your macbook that can code. but i think we're quite a ways off from that.

	▲	chatmasta 5 hours ago \| parent \| next [-]
		Just a hunch, but I think the most cost effective “local” deployment method right now is renting GPU clusters by the hour and running all the inference software on them yourself. This will be cheaper than capital expenditure on hardware that will depreciate and become last-gen, and cheaper than OpenRouter pay per token.
	▲	tomr75 5 hours ago \| parent \| prev [-]
		people who can't afford Claude max 200 are using qwen 3.6 27b for local coding assistance already

▲

stymaar 4 hours ago | parent | prev [-]

Honestly, Qwen3.6 is already what you need for the large majority of tasks.

(I only ask Opus every 5 to 10 requests, when my local Qwen fails or when I encounter a situation that is too world-knowledge specific to be worth asking, but that way you can live easily with Claude's cheapest plan without ever facing usage limit).