| ▲ | abhikul0 9 hours ago | |
I'll try to use that, but llama-server has mmap on by default and the model still takes up the size of the model in RAM, not sure what's going on. | ||
| ▲ | zozbot234 9 hours ago | parent [-] | |
Try running CPU-only inference to troubleshoot that. GPU layers will likely just ignore mmap. | ||