| ▲ | AlbinoDrought 4 hours ago | |||||||
This model does not fit in 12G of VRAM - even the smallest quant is unlikely to fit. However, portions can be offloaded to regular RAM / CPU with a performance hit. I would recommend trying llama.cpp's llama-server with models of increasing size until you hit the best quality / speed tradeoff with your hardware that you're willing to accept. The Unsloth guides are a great place to start: https://unsloth.ai/docs/models/qwen3-coder-next#llama.cpp-tu... | ||||||||
| ▲ | zokier 2 hours ago | parent [-] | |||||||
Thanks for the pointers! one more thing, that guide says: > You can choose UD-Q4_K_XL or other quantized versions. I see eight different 4-bit quants (I assume that is the size I want?).. how to pick which one to use? | ||||||||
| ||||||||