Mac has unified memory, so 36GB is 36GB for everything- gpu,cpu.

zozbot234 10 hours ago | parent | next [-]

CPU-MoE still helps with mmap. Should not overly hurt token-gen speed on the Mac since the CPU has access to most (though not all) of the unified memory bandwidth, which is the bottleneck.

▲

abhikul0 9 hours ago | parent [-]

I'll try to use that, but llama-server has mmap on by default and the model still takes up the size of the model in RAM, not sure what's going on.

	▲	zozbot234 9 hours ago \| parent [-]
		Try running CPU-only inference to troubleshoot that. GPU layers will likely just ignore mmap.

▲

mhitza 10 hours ago | parent | prev [-]

For sure I was running on autopilot with that reply. Though in Q4 I would expect it to fit, as 24B-A4B Gemma model without CPU offloading got up to 18GB of VRAM usage