Remix.run Logo
bigyabai 3 hours ago

You don't need a REAP-processed model to offload on a per-expert basis. All MoE models are inherently sparse, so you're only operating on a subset of activated layers when the prompt is being processed. It's more of a PCI bottleneck than a CPU one.

> And I don’t consider running a 1-bit superquant to be a valid thing here either.

I don't either. MXFP4 is scalar.

coder543 3 hours ago | parent [-]

Yes, you can offload random experts to the GPU, but it will still be activating experts that are on the CPU, completely tanking performance. It won't suddenly make things fast. One of these GPUs is not enough for this model.

You're better off prioritizing the offload of the KV cache and attention layers to the GPU than trying to offload a specific expert or two, but the performance loss I was talking about earlier still means you're not offloading enough for a 96GB GPU to make things how they need to be. You need multiple, or you need a Mac Studio.

If someone buys one of these $8000 GPUs to run GLM-4.7, they're going to be immensely disappointed. This is my point.

2 hours ago | parent [-]
[deleted]