FWIW, that's a 80GB model and you also need kv cache. You'd need 96GBish to run on the GPU.

Do you know if it's doing what was described earlier, when I run it with all layers on GPU - paging an expert in every time the expert changes? Each expert is only 5.1B parameters.

▲

EnPissant 4 days ago | parent | next [-]

It makes absolutely no sense to do what OP described. The decode stage is bottlenecked on memory bandwidth. Once you pull the weights from system RAM, your work is almost done. To then gigabytes of weights PER TOKEN over PCIE to do some trivial computation on the GPU is crazy.

What actually happens is you run some or all of the MoE layers on the CPU from system RAM. This can be tolerable for smaller MoE models, but keeping it all on the GPU will still be 5-10x faster.

I'm guessing lmstudio gracefully falls back to running _soemthing_ on the CPU. Hopefully you are running only MoE on the CPU. I've only ever used llama.cpp.

▲

furyofantares 4 days ago | parent [-]

I tried a few things and checked CPU usage in Task Manager to see how much work the CPU is doing.

KV Cache in GPU and 36/36 layers in GPU: CPU usage under 3%.

KV Cache in GPU and 35/36 layers in GPU: CPU usage at 35%.

KV Cache moved to CPU and 36/36 layers in GPU: CPU usage at 34%.

I believe you that it doesn't make sense to do it this way, it is slower, but it doesn't appear to be doing much of anything on the CPU.

You say gigabytes of weights PER TOKEN, is that true? I think an expert is about 2 GB, so a new expert is 2 GB, sure - but I might have all the experts for the token already in memory, no?

▲

EnPissant 4 days ago | parent [-]

gpt-oss-120b chooses 4 experts per token and combines them.

I don't know how lmstudio works. I only know the fundamentals. There is not way it's sending experts to the GPU per token. Also, the CPU doesn't have much work to do. It's mostly waiting on memory.

	▲	furyofantares 4 days ago \| parent [-]
		> There is not way it's sending experts to the GPU per token. Right, it seems like either experts are stable across sequential tokens fairly often, or there's more than 4 experts in memory and it's stable within the in-memory experts for sequential tokens fairly often, like the poster said.

▲

furyofantares 4 days ago | parent | prev [-]

^ Er, misspoke, each expert is at most .9 B parameters there's 128 experts. 5.1 B is number of active parameters (4 experts + some other parameters).