Remix.run Logo
ThatPlayer 4 days ago

> behemoth 100B+ parameter models, but to run those I would need to invest much more into this hobby than I'm willing to do.

Have you tried newer MoE models with llama.cpp's recent '--n-cpu-moe' option to offload MoE layers to the CPU? I can run gpt-oss-120b (5.1B active) on my 4080 and get a usable ~20 tk/s. Had to upgrade my system RAM, but that's easier. https://github.com/ggml-org/llama.cpp/discussions/15396 has a bit on getting that running

imiric 3 days ago | parent [-]

I use Ollama which offloads to the CPU automatically IIRC. IME the performance drops dramatically when that happens, and it hogs the CPU making the system unresponsive for other tasks, so I try to avoid it.

ThatPlayer 3 days ago | parent [-]

I don't believe that's the same thing. That should be the generic offloading that ollama will do to any too big model, while this feature requires MoE models. https://github.com/ollama/ollama/issues/11772 is the feature request for similar on ollama.

One comment in that thread mentions getting almost 30tk/s from gpt-oss-120b on a 3090 with llama.cpp compared to 8tk/s with ollama.

This feature is limited to MoE models, but those seem to be gaining traction with gpt-oss, glm-4.5, and qwen3

imiric 3 days ago | parent [-]

Ah, I was not aware of that, thanks. I'll give it a try.