Remix.run Logo
sosodev 5 hours ago

I don’t have any particular model in mind, sorry. My data is just rough estimates based on my experience with a single node setup. You might need to opt for a 2 or 3 bit model to get the full context window. The KV cache memory consumption as well overall performance will be heavily dependent on the model’s architecture. The performance too will depend a lot on the inference server chosen and its configuration. I suspect a sub-agent running a much smaller model would be the ideal way to get the latest knowledge via web search and summarization.

I’m not trying to say that this would be a great experience or really compete with just buying a subscription to the top models. Rather I just wanted to point out that $300k is an absurd estimate for a trillion param model meant for personal use.

nijave 5 hours ago | parent [-]

I imagine a smaller single node model would have a significantly better experience at significantly lower cost. When I was poking around with infra estimates it seemed the main issue/cost was once you crossed from single-node to multi-node. You need _a lot_ of bandwidth if the weights are sharded. Like Tbps of bandwidth. The closest reasonable thing I've heard of for local multi-node is exo on macos using thunderbolt interconnect.

sosodev 2 hours ago | parent [-]

I think it really just depends on your goals. Slow tokens per second is fine by some people if they cost a fraction of a single node setup that can run a trillion param model. If you’re actually running a small business and want to have multiple users getting a good experience in parallel then yeah I think you need a single node. At that point you can afford it I suppose.

I don’t know what the scaling for multiple strix halo boards looks like in practice. From what I understand each server has to process the model in serial. Meaning server A has 1/4 the weights and sends server B the results to process and so on. So you don’t get compute scaling just memory scaling.