Remix.run Logo
mwcampbell 10 hours ago

Given that it's a 400B-parameter model, but it's a sparse MoE model with 13B active parameters per token, would it run well on an NVIDIA DGX Spark with 128 GB of unified RAM, or do you practically need to hold the full model in RAM even with sparse MoE?

timschmidt 10 hours ago | parent | next [-]

Even with MoE, holding the model in RAM while individual experts are evaluated in VRAM is a bit of a compromise. Experts can be swapped in and out of VRAM for each token. So RAM <-> VRAM bandwidth becomes important. With a model larger than RAM, that bandwidth bottleneck gets pushed to the SSD interface. At least it's read-only, and not read-write, but even the fastest of SSDs will be significantly slower than RAM.

That said, there are folks out there doing it. https://github.com/lyogavin/airllm is one example.

nick49488171 3 hours ago | parent [-]

With a non-sequential generative approach perhaps the RAM cache misses could be grouped together and swapped on a when available/when needed prioritized bases.

antirez 10 hours ago | parent | prev | next [-]

Can run with mmap() but it is slower. 4-bit quantized there is a decent ratio between the model size and the RAM, with a fast SSD one could try to see how it works. However when a model is 4-bit quantized there is often the doubt that it is not better than an 8-bit quantized model of 200B parameters, it depends on the model, on the use case, ... Unfortunately the street for local inference of SOTA model is being stopped by the RAM prices and the GPU request of the companies, leaving us with little. Probably today the best bet is to buy Mac Studio systems and then run distributed inference (MLX supports this for instance), or a 512 GB Mac Studio M4 that costs, like 13k$.

vardump 2 hours ago | parent | next [-]

I think 512 GB Mac Studio was M3 Ultra.

Anyways, isn't a new Mac Studio due in a few months? It should be significantly faster as well.

I just hope RAM prices don't ruin this...

notpublic 9 hours ago | parent | prev [-]

Talking about RAM prices, you can still get a framework Max+ 395 with 128GB RAM for ~$2,459 USD. They have not increased the price for it yet.

https://frame.work/products/desktop-diy-amd-aimax300/configu...

Scipio_Afri 9 hours ago | parent [-]

Pretty sure those use to be $1999 ... but not entirely sure

notpublic 9 hours ago | parent [-]

Yep. You be right. Looks like they increased it earlier this month. Bummer!

jychang 7 hours ago | parent | prev [-]

No.

128GB vram gets you enough space for 256B sized models. But 400B is too big for the DGX Spark, unless you connect 2 of them together and use tensor parallel.