Not true. With a MoE, you can offload quite a bit of the model to CPU without losing a ton of performance. 16GB should be fine to run the 4-bit (or larger) model at speeds that are decent. The --n-cpu-moe parameter is the key one on llama-server, if you're not just using -fit on.

▲

boppo1 4 hours ago | parent [-]

I've been way out of the local game for a while now, what's the best way to run models for a fairly technical user? I was using llama.cpp in the command line before and using bash files for prompts.

	▲	adrian_b 39 minutes ago \| parent [-]
		Running llama-server (it belongs to llama.cpp) starts a HTTP server on a specified port. You can connect to that port with any browser, for chat. Or you can connect to that port with any application that supports the OpenAI API, e.g. a coding assistant harness.