| ▲ | coder543 8 hours ago | |||||||
Not true. With a MoE, you can offload quite a bit of the model to CPU without losing a ton of performance. 16GB should be fine to run the 4-bit (or larger) model at speeds that are decent. The --n-cpu-moe parameter is the key one on llama-server, if you're not just using -fit on. | ||||||||
| ▲ | boppo1 4 hours ago | parent [-] | |||||||
I've been way out of the local game for a while now, what's the best way to run models for a fairly technical user? I was using llama.cpp in the command line before and using bash files for prompts. | ||||||||
| ||||||||