You can run a trillion parameter model with decent quality for far less than $300k. A cluster of 4 AMD AI Max 395+ boards with 128GB unified memory each can be had for around $15k. That would run the 4-bit quant of a trillion param model well enough for personal use. At full use the cluster would only be consuming around 400-500W of power too. That's about the same as one high end graphics card.

That's still a lot of money, but most people don't really need a trillion parameter model. If privacy is more valuable than the frontier capabilities then they could almost certainly get by with much less.

▲

anigbrowl 2 hours ago | parent | next [-]

I literally wrote about running quantized models and how much more affordable it could be in the very next sentence. Please don't reply if you can't be bothered to read the entire comment, it's not that long.

	▲	sosodev 2 hours ago \| parent [-]
		I read the comment, thanks. I just disagree with your cost estimate. Even for a small business that needs high throughput they could probably do it for far less than $300k if they aren’t just blindly buying the first big nvidia setup they can.

▲

nijave 6 hours ago | parent | prev [-]

Which model? I see a suspiciously similar post on amd.com running 2 bit Kimi quant on a four node cluster over 5Gbps Ethernet

Assuming math works here although I think there's some caveats depending on the model architecture, 1T 4 bit is 465Gi just for the weights so you wouldn't be able to fit kv cache.

It's showing about 8-9 tk/sec which seems quite slow for something like a web search with result aggregate although maybe bareable for smaller context stuff

The thing I've been running into with z.ai hosted GLM-5.2 is the 2024 knowledge cutoff. Anything recent requires web augmentation which is more token intensive so low tk/sec hurts even more than a "smarter" model

It seems (somewhat unsurprisingly) open weight models have older knowledge cutoffs.

▲

sosodev 5 hours ago | parent [-]

I don’t have any particular model in mind, sorry. My data is just rough estimates based on my experience with a single node setup. You might need to opt for a 2 or 3 bit model to get the full context window. The KV cache memory consumption as well overall performance will be heavily dependent on the model’s architecture. The performance too will depend a lot on the inference server chosen and its configuration. I suspect a sub-agent running a much smaller model would be the ideal way to get the latest knowledge via web search and summarization.

I’m not trying to say that this would be a great experience or really compete with just buying a subscription to the top models. Rather I just wanted to point out that $300k is an absurd estimate for a trillion param model meant for personal use.

▲

nijave 5 hours ago | parent [-]

I imagine a smaller single node model would have a significantly better experience at significantly lower cost. When I was poking around with infra estimates it seemed the main issue/cost was once you crossed from single-node to multi-node. You need _a lot_ of bandwidth if the weights are sharded. Like Tbps of bandwidth. The closest reasonable thing I've heard of for local multi-node is exo on macos using thunderbolt interconnect.

	▲	sosodev 2 hours ago \| parent [-]
		I think it really just depends on your goals. Slow tokens per second is fine by some people if they cost a fraction of a single node setup that can run a trillion param model. If you’re actually running a small business and want to have multiple users getting a good experience in parallel then yeah I think you need a single node. At that point you can afford it I suppose. I don’t know what the scaling for multiple strix halo boards looks like in practice. From what I understand each server has to process the model in serial. Meaning server A has 1/4 the weights and sends server B the results to process and so on. So you don’t get compute scaling just memory scaling.