Remix.run Logo
sosodev 2 hours ago

I think it really just depends on your goals. Slow tokens per second is fine by some people if they cost a fraction of a single node setup that can run a trillion param model. If you’re actually running a small business and want to have multiple users getting a good experience in parallel then yeah I think you need a single node. At that point you can afford it I suppose.

I don’t know what the scaling for multiple strix halo boards looks like in practice. From what I understand each server has to process the model in serial. Meaning server A has 1/4 the weights and sends server B the results to process and so on. So you don’t get compute scaling just memory scaling.