| ▲ | sosodev 9 hours ago | |||||||||||||||||||||||||
You can run a trillion parameter model with decent quality for far less than $300k. A cluster of 4 AMD AI Max 395+ boards with 128GB unified memory each can be had for around $15k. That would run the 4-bit quant of a trillion param model well enough for personal use. At full use the cluster would only be consuming around 400-500W of power too. That's about the same as one high end graphics card. That's still a lot of money, but most people don't really need a trillion parameter model. If privacy is more valuable than the frontier capabilities then they could almost certainly get by with much less. | ||||||||||||||||||||||||||
| ▲ | anigbrowl 2 hours ago | parent | next [-] | |||||||||||||||||||||||||
I literally wrote about running quantized models and how much more affordable it could be in the very next sentence. Please don't reply if you can't be bothered to read the entire comment, it's not that long. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | nijave 6 hours ago | parent | prev [-] | |||||||||||||||||||||||||
Which model? I see a suspiciously similar post on amd.com running 2 bit Kimi quant on a four node cluster over 5Gbps Ethernet Assuming math works here although I think there's some caveats depending on the model architecture, 1T 4 bit is 465Gi just for the weights so you wouldn't be able to fit kv cache. It's showing about 8-9 tk/sec which seems quite slow for something like a web search with result aggregate although maybe bareable for smaller context stuff The thing I've been running into with z.ai hosted GLM-5.2 is the 2024 knowledge cutoff. Anything recent requires web augmentation which is more token intensive so low tk/sec hurts even more than a "smarter" model It seems (somewhat unsurprisingly) open weight models have older knowledge cutoffs. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||