| ▲ | pdyc 7 hours ago | |
correct but it should be some ratio of model size like if model size is x GB, max context would occupy x * some constant of RAM. For quantized version assuming its 18GB for Q4 it should be able to support 64-128k with this mac | ||
| ▲ | abhikul0 7 hours ago | parent [-] | |
For the 9B model, I can use the full context with Q8_0 KV. This uses around ~16GB, while still leaving a comfortable headroom. Output after I exit the llama-server command: | ||