| ▲ | abhikul0 10 hours ago |
| I hope the other sizes are coming too(9B for me). Can't fit much context with this on a 36GB mac. |
|
| ▲ | mhitza 10 hours ago | parent | next [-] |
| It's a MoE model and the A3B stands for 3 Billion active parameters, like the recent Gemma 4. You can try to offload the experts on CPU with llama.cpp (--cpu-moe) and that should give you quite the extra context space, at a lower token generation speed. |
| |
| ▲ | abhikul0 10 hours ago | parent | next [-] | | Mac has unified memory, so 36GB is 36GB for everything- gpu,cpu. | | |
| ▲ | zozbot234 10 hours ago | parent | next [-] | | CPU-MoE still helps with mmap. Should not overly hurt token-gen speed on the Mac since the CPU has access to most (though not all) of the unified memory bandwidth, which is the bottleneck. | | |
| ▲ | abhikul0 9 hours ago | parent [-] | | I'll try to use that, but llama-server has mmap on by default and the model still takes up the size of the model in RAM, not sure what's going on. | | |
| ▲ | zozbot234 9 hours ago | parent [-] | | Try running CPU-only inference to troubleshoot that. GPU layers will likely just ignore mmap. |
|
| |
| ▲ | mhitza 10 hours ago | parent | prev [-] | | For sure I was running on autopilot with that reply. Though in Q4 I would expect it to fit, as 24B-A4B Gemma model without CPU offloading got up to 18GB of VRAM usage |
| |
| ▲ | dgb23 10 hours ago | parent | prev | next [-] | | Do I expect the same memory footprint from an N active parameters as from simply N total parameters? | | |
| ▲ | daemonologist 10 hours ago | parent | next [-] | | No - this model has the weights memory footprint of a 35B model (you do save a little bit on the KV cache, which will be smaller than the total size suggests). The lower number of active parameters gives you faster inference, including lower memory bandwidth utilization, which makes it viable to offload the weights for the experts onto slower memory. On a Mac, with unified memory, this doesn't really help you. (Unless you want to offload to nonvolatile storage, but it would still be painfully slow.) All that said you could probably squeeze it onto a 36GB Mac. A lot of people run this size model on 24GB GPUs, at 4-5 bits per weight quantization and maybe with reduced context size. | |
| ▲ | 10 hours ago | parent | prev [-] | | [deleted] |
| |
| ▲ | pdyc 10 hours ago | parent | prev [-] | | i dont get it, mac has unified memory how would offloading experts to cpu help? | | |
| ▲ | bee_rider 10 hours ago | parent [-] | | I bet the poster just didn’t remember that important detail about Macs, it is kind of unusual from a normal computer point of view. I wonder though, do Macs have swap, coupled unused experts be offloaded to swap? | | |
| ▲ | abhikul0 10 hours ago | parent [-] | | Of course the swap is there for fallback but I hate using it lol as I don't want to degrade SSD longevity. |
|
|
|
|
| ▲ | pdyc 10 hours ago | parent | prev [-] |
| can you elaborate? you can use quantized version, would context still be an issue with it? |
| |
| ▲ | abhikul0 10 hours ago | parent | next [-] | | A usable quant, Q5_KM imo, takes up ~26GB[0], which leaves around ~6-7GB for context and running other programs which is not much. [0] https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF?show_fil... | |
| ▲ | nickthegreek 10 hours ago | parent | prev [-] | | context is always an issue with local models and consumer hardware. | | |
| ▲ | pdyc 10 hours ago | parent [-] | | correct but it should be some ratio of model size like if model size is x GB, max context would occupy x * some constant of RAM. For quantized version assuming its 18GB for Q4 it should be able to support 64-128k with this mac | | |
| ▲ | abhikul0 10 hours ago | parent [-] | | For the 9B model, I can use the full context with Q8_0 KV. This uses around ~16GB, while still leaving a comfortable headroom. Output after I exit the llama-server command: llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - MTL0 (Apple M3 Pro) | 28753 = 14607 + (14145 = 6262 + 4553 + 3329) + 0 |
llama_memory_breakdown_print: | - Host | 2779 = 666 + 0 + 2112 |
|
|
|
|