can you elaborate? you can use quantized version, would context still be an issue with it?

▲ abhikul0 9 hours ago | parent | next [-]

A usable quant, Q5_KM imo, takes up ~26GB[0], which leaves around ~6-7GB for context and running other programs which is not much.

[0] https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF?show_fil...

▲ nickthegreek 9 hours ago | parent | prev [-]

context is always an issue with local models and consumer hardware.

▲ pdyc 9 hours ago | parent [-]

correct but it should be some ratio of model size like if model size is x GB, max context would occupy x * some constant of RAM. For quantized version assuming its 18GB for Q4 it should be able to support 64-128k with this mac

▲ abhikul0 8 hours ago | parent [-]

For the 9B model, I can use the full context with Q8_0 KV. This uses around ~16GB, while still leaving a comfortable headroom.

Output after I exit the llama-server command:

  llama_memory_breakdown_print: | memory breakdown [MiB]  | total    free     self   model   context   compute    unaccounted |
  llama_memory_breakdown_print: |   - MTL0 (Apple M3 Pro) | 28753 = 14607 + (14145 =  6262 +    4553 +    3329) +           0 |
  llama_memory_breakdown_print: |   - Host                |                   2779 =   666 +       0 +    2112                |