This model does not fit in 12G of VRAM - even the smallest quant is unlikely to fit. However, portions can be offloaded to regular RAM / CPU with a performance hit.

I would recommend trying llama.cpp's llama-server with models of increasing size until you hit the best quality / speed tradeoff with your hardware that you're willing to accept.

The Unsloth guides are a great place to start: https://unsloth.ai/docs/models/qwen3-coder-next#llama.cpp-tu...

▲ zokier 2 hours ago | parent [-]

Thanks for the pointers!

one more thing, that guide says:

> You can choose UD-Q4_K_XL or other quantized versions.

I see eight different 4-bit quants (I assume that is the size I want?).. how to pick which one to use?

    IQ4_XS
    Q4_K_S
    Q4_1
    IQ4_NL
    MXFP4_MOE
    Q4_0
    Q4_K_M
    Q4_K_XL

	▲	MrDrMcCoy an hour ago \| parent [-]
		The I-prefix stands for Imatrix smoothing in the quantization. It trades a little more accuracy for speed than other quant styles. The _0 and _1 quants are older, simpler quants that are very accurate but kinda slow. The K quants, in my limited understanding, primarily quantize at the specified bit depth, but will bump certain important areas higher, and less used parts lower. It generally performs better while providing similar accuracy to the _1 quants. MXFP4 is specific to Nvidia, so I can't use it on my AMD hardware. It's supposed to be very efficient. The UD part includes more of Unsloth's speed optimizations. Also, depending on how much regular system RAM you have, you can offload mixture-of-expert models like this, keeping only the most important layers on your GPU. This may let you use larger, more accurate quants. That is functionality that is supported by llama.cpp and other frameworks and is worth looking into how to do.