I’m curious (and please forgive my ignorance if it’s obvious), are open weight models practically feasible?

I mean from a financial and sustainability standpoint, assuming they’re equally powerful as their proprietary counterparts.

I guess I’m trying to understand the economics of it.

▲

SimianSci 9 hours ago | parent | next [-]

There is an understandable gap between the capabilities of closed models and those of open models. The current difference is primarily expressed in the cost of hardware necessary to sufficiently run a exactly comparable model. A single higher end graphics card running on your average gaming computer, is capable of running small to medium models that compare with those of their lab-born counterparts in the small-medium range. But the heavyweight models are still outside the realm of possibility for all but the most well-funded individual.

However, I would highly suggest more people experiment with these smaller models. They are incredibly capable in many ways that many people dont realize.

The perceived capabilities of the larger models are also much less the result of the model having more parameters/training cycles, but rather that they are being run through well-made harnesses, something which the open-source community is rapidly approaching with near-peer solutions of their own.

In short, much of the gap between between open-weight models and the larger proprietary models can be considered more of an issue of perception and not an issue of capability. There is a fundamental gap economically, but not so much in capability. The open source community is rapidly closing the gap on these larger labs, especially thanks to the amazing research being freely given openly by well funded chinese labs.

▲

anigbrowl 9 hours ago | parent | prev | next [-]

Sort of. A full trillion-parameter model needs about $300k of server hardware to run in and a lot of electricity, making it feasible only for very wealthy individuals, but quite practical for businesses and institutions above a certain size...although they in turn would typically gatekeep access.

You can drastically reduce the requirements by running models at a lower bitrate, which somewhat reduces accuracy but not that much - think of the difference between an MP3 vs uncompressed audio. With this and other tricks, you can get high end models down to a size where they can be run on a high spec desktop workstation affordable by an individual or small business.

Obviously I'm heavily oversimplifying here. I think a useful parallel is to consider situations from the past where you would once have required corporate budgets equivalent to the price of a house to run a large database, but over time it became accessible to anyone with the requisite expertise and relatively affordable hardware.

▲

sosodev 9 hours ago | parent [-]

You can run a trillion parameter model with decent quality for far less than $300k. A cluster of 4 AMD AI Max 395+ boards with 128GB unified memory each can be had for around $15k. That would run the 4-bit quant of a trillion param model well enough for personal use. At full use the cluster would only be consuming around 400-500W of power too. That's about the same as one high end graphics card.

That's still a lot of money, but most people don't really need a trillion parameter model. If privacy is more valuable than the frontier capabilities then they could almost certainly get by with much less.

▲

anigbrowl 2 hours ago | parent | next [-]

I literally wrote about running quantized models and how much more affordable it could be in the very next sentence. Please don't reply if you can't be bothered to read the entire comment, it's not that long.

	▲	sosodev 2 hours ago \| parent [-]
		I read the comment, thanks. I just disagree with your cost estimate. Even for a small business that needs high throughput they could probably do it for far less than $300k if they aren’t just blindly buying the first big nvidia setup they can.

▲

nijave 6 hours ago | parent | prev [-]

Which model? I see a suspiciously similar post on amd.com running 2 bit Kimi quant on a four node cluster over 5Gbps Ethernet

Assuming math works here although I think there's some caveats depending on the model architecture, 1T 4 bit is 465Gi just for the weights so you wouldn't be able to fit kv cache.

It's showing about 8-9 tk/sec which seems quite slow for something like a web search with result aggregate although maybe bareable for smaller context stuff

The thing I've been running into with z.ai hosted GLM-5.2 is the 2024 knowledge cutoff. Anything recent requires web augmentation which is more token intensive so low tk/sec hurts even more than a "smarter" model

It seems (somewhat unsurprisingly) open weight models have older knowledge cutoffs.

▲

sosodev 5 hours ago | parent [-]

I don’t have any particular model in mind, sorry. My data is just rough estimates based on my experience with a single node setup. You might need to opt for a 2 or 3 bit model to get the full context window. The KV cache memory consumption as well overall performance will be heavily dependent on the model’s architecture. The performance too will depend a lot on the inference server chosen and its configuration. I suspect a sub-agent running a much smaller model would be the ideal way to get the latest knowledge via web search and summarization.

I’m not trying to say that this would be a great experience or really compete with just buying a subscription to the top models. Rather I just wanted to point out that $300k is an absurd estimate for a trillion param model meant for personal use.

▲

nijave 5 hours ago | parent [-]

I imagine a smaller single node model would have a significantly better experience at significantly lower cost. When I was poking around with infra estimates it seemed the main issue/cost was once you crossed from single-node to multi-node. You need _a lot_ of bandwidth if the weights are sharded. Like Tbps of bandwidth. The closest reasonable thing I've heard of for local multi-node is exo on macos using thunderbolt interconnect.

	▲	sosodev 2 hours ago \| parent [-]
		I think it really just depends on your goals. Slow tokens per second is fine by some people if they cost a fraction of a single node setup that can run a trillion param model. If you’re actually running a small business and want to have multiple users getting a good experience in parallel then yeah I think you need a single node. At that point you can afford it I suppose. I don’t know what the scaling for multiple strix halo boards looks like in practice. From what I understand each server has to process the model in serial. Meaning server A has 1/4 the weights and sends server B the results to process and so on. So you don’t get compute scaling just memory scaling.

▲

roadside_picnic 9 hours ago | parent | prev | next [-]

See my comment to parent. I've been using local LLMs for practical, personal tasks for a few months now very successfuly.

You can run fantastic local models if you have either:

- M-series Apple device with ideally >= 24GB of VRAM

- RTX [345]090 GPU

I'm fortunate enough to have both and use an M-series laptop as basically a persistent server (I don't use it much and when traveling typically just use my work laptop). My desktop doesn't act as a persitent server but I fire up llama.cpp on it all time for quick chat sessions.

If you have one of the above devices and can dedicate it as server there are additional layers of tooling you can use that dramatically improve the experience. In particular Open WebUI allows you to add tons of useful tools (image gen, web search, code eval, etc), and agent harnesses like Hermes can make the current gen small models very capable. I have an agent in chat on my phone that basically handles all the sys-admin for the server it runs on.

▲

hn_acc1 8 hours ago | parent [-]

What about RTX 3080? Too little VRAM?

	▲	roadside_picnic 8 hours ago \| parent [-]
		In addition to models getting better, the quantization methods have also got much better. If you already have an RTX 3080 it's absolutely worth the time to just mess around and see how it does, experiment with different quants that fit in your VRAM. If you're purchasing I would recommend coughing up the extra cash for the 3090. If you are experimenting it's worth mentioning that the harness/tooling is very important to getting a solid experience. Herme's agent is great for running helpful agents and OpenWeb UI can get really make the experience feel on par with paid chat interfaced. A reasonable halfway step is to pay for an open model through the provider or open router. You'll get many of the benefits (especially around pricing) without needing to shell out on hardware before deciding if you like the way these models work.

▲

KronisLV 8 hours ago | parent | prev | next [-]

> I mean from a financial and sustainability standpoint, assuming they’re equally powerful as their proprietary counterparts.

Presently they trail SOTA by about 6-12 months, not on par (average across everything they do).

DeepSeek V4 Pro with Max reasoning is very affordable even if you pay per-token, this month I pushed about 486 million tokens through it (I will admit that >95% was cache hits, for agentic development pretty typical) and it cost me about 8 USD in total. Meanwhile with Opus or even Sonnet if I had to pay API prices, I would be a more sad camper. That model makes a lot of stupid things though, so not ideal.

Meanwhile GLM-5.2 that came out is also quote capable and is near Opus in many tasks, all while their coding plan is more cost effective than Anthropic's: https://z.ai/subscribe

I will still stick with Anthropic but consider downgrading from Max 5x to Pro which will change the monthly expenses from around 108 EUR down to <20 EUR (they have a discount too if you pay for a year up front), and probably get the yearly GLM Pro plan which should decrease my yearly expenses from around 1300 EUR total to about 750 total EUR while still giving me a fairly decent setup.

For the consumer, that is doable and practical.

For the people actually running these models, who knows - at least DeepSeek and others are trying to make the models more efficient so the numbers are more feasible.

Also have run Qwen3.6 35B A3B on prem and it kinda sucks. Way better than models that size a year ago, but still lags behind Sonnet and also DeepSeek V4 Flash due to the size limits. Plus to even run myself I'd need a pretty beefy setup, most likely a pair of Intel Arc Pro B70s with 32 GB of VRAM each that I could still run off of my PSU but the actual model output would be kinda bullshit and I'd have to spend an unpleasant amount of time fixing it.

▲

hatthew 9 hours ago | parent | prev | next [-]

I'm also curious, specifically about the cost of training vs inference, and comparing that to other industries that can have high R&D costs. My instinct says that open weights aren't feasible because of the obvious issue where there is no incentive to develop your own model rather than just taking someone else's model. However, I could see a scenario where a hardware company designs a model that is open weights but optimized strongly for their own proprietary hardware, cutting their costs of inference low enough to be competitive with a hypothetical other company that doesn't have any R&D expediture.

▲

sosodev 9 hours ago | parent | prev | next [-]

It depends entirely on what you want to do and think is feasible. Small models can almost certainly run on the computer that you already have. They can do good tool calling.

▲

epolanski 9 hours ago | parent | prev | next [-]

Yes they are you can use Qwen, DS4 Pro and GLM 5.2 if you have the hardware to do so.

They are not SOTA in various ways but they have better economics.

▲

waffletower 9 hours ago | parent | prev | next [-]

If attractive, cloud providers could develop open models with their own investment, and sell hosted access as a business model. While Google checks these boxes, I haven't seen a Google much marketing focus upon their open models (Gemma) coupled with hosting. groq could conceivably train its own models, but groq's business model hosts open models (GPT OSS, Qwen 3, Llama 4 are currently their prominently advertised models on their site... which seems out of date to me) trained by other organizations.

▲

andrewstuart2 9 hours ago | parent | prev [-]

I hope/wonder if it will go the way computers did. We may learn to more effectively build RAM or parallel compute, and use it more effectively, in the coming decade in such a way that we can democratize more and more like we did with processors to the point that they're ubiquitous.