| ▲ | zozbot234 7 hours ago | |||||||
CPU-MoE still helps with mmap. Should not overly hurt token-gen speed on the Mac since the CPU has access to most (though not all) of the unified memory bandwidth, which is the bottleneck. | ||||||||
| ▲ | abhikul0 6 hours ago | parent [-] | |||||||
I'll try to use that, but llama-server has mmap on by default and the model still takes up the size of the model in RAM, not sure what's going on. | ||||||||
| ||||||||