▲ | ThatPlayer 4 days ago | ||||||||||||||||
> behemoth 100B+ parameter models, but to run those I would need to invest much more into this hobby than I'm willing to do. Have you tried newer MoE models with llama.cpp's recent '--n-cpu-moe' option to offload MoE layers to the CPU? I can run gpt-oss-120b (5.1B active) on my 4080 and get a usable ~20 tk/s. Had to upgrade my system RAM, but that's easier. https://github.com/ggml-org/llama.cpp/discussions/15396 has a bit on getting that running | |||||||||||||||||
▲ | imiric 3 days ago | parent [-] | ||||||||||||||||
I use Ollama which offloads to the CPU automatically IIRC. IME the performance drops dramatically when that happens, and it hogs the CPU making the system unresponsive for other tasks, so I try to avoid it. | |||||||||||||||||
|