For those interested, made some Dynamic Unsloth GGUFs for local deployment at https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF and made a guide on using Claude Code / Codex locally: https://unsloth.ai/docs/models/qwen3-coder-next

▲ genpfault 6 hours ago | parent | next [-]

Nice! Getting ~39 tok/s @ ~60% GPU util. (~170W out of 303W per nvtop).

System info:

    $ ./llama-server --version
    ggml_vulkan: Found 1 Vulkan devices:
    ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
    version: 7897 (3dd95914d)
    built with GNU 11.4.0 for Linux x86_64

llama.cpp command-line:

    $ ./llama-server --host 0.0.0.0 --port 2000 --no-warmup \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --jinja --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --fit on \
    --ctx-size 32768

▲

halcyonblue 5 hours ago | parent [-]

What am I missing here? I thought this model needs 46GB of unified memory for 4-bit quant. Radeon RX 7900 XTX has 24GB of memory right? Hoping to get some insight, thanks in advance!

	▲	coder543 5 hours ago \| parent [-]
		MoEs can be efficiently split between dense weights (attention/KV/etc) and sparse (MoE) weights. By running the dense weights on the GPU and offloading the sparse weights to slower CPU RAM, you can still get surprisingly decent performance out of a lot of MoEs. Not as good as running the entire thing on the GPU, of course.

▲ bityard 5 hours ago | parent | prev | next [-]

Hi Daniel, I've been using some of your models on my Framework Desktop at home. Thanks for all that you do.

Asking from a place of pure ignorance here, because I don't see the answer on HF or in your docs: Why would I (or anyone) want to run this instead of Qwen3's own GGUFs?

▲ MrDrMcCoy an hour ago | parent | prev | next [-]

Still hoping IQuest-Coder gets the same treatment :)

▲ ranger_danger 7 hours ago | parent | prev | next [-]

What is the difference between the UD and non-UD files?

▲

danielhanchen 7 hours ago | parent [-]

UD stands for "Unsloth-Dynamic" which upcasts important layers to higher bits. Non UD is just standard llama.cpp quants. Both still use our calibration dataset.

▲

CamperBob2 6 hours ago | parent | next [-]

Please consider authoring a single, straightforward introductory-level page somewhere that explains what all the filename components mean, and who should use which variants.

The green/yellow/red indicators for different levels of hardware support are really helpful, but far from enough IMO.

▲

danielhanchen 6 hours ago | parent | next [-]

Oh good idea! In general UD-Q4_K_XL (Unsloth Dynamic 4bits Extra Large) is what I generally recommend for most hardware - MXFP4_MOE is also ok

▲

Keats 5 hours ago | parent [-]

Is there some indication on how the different bit quantization affect performance? IE I have a 5090 + 96GB so I want to get the best possible model but I don't care about getting 2% better perf if I only get 5 tok/s.

▲

mirekrusin 3 hours ago | parent [-]

It takes download time + 1 minute to test speed yourself, you can try different quants, it's hard to write down a table because it depends on your system ie. ram clock etc. if you go out of gpu.

I guess it would make sense to have something like max context size/quants that fit fully on common configs with gpus, dual gpus, unified ram on mac etc.

	▲	Keats 3 hours ago \| parent [-]
		Testing speed is easy yes, I'm mostly wondering about the quality difference between Q6 vs Q8_K_XL for example.

▲

segmondy 6 hours ago | parent | prev [-]

The green/yellow/red indicators are based on what you set for your hardware on huggingface.

▲

ranger_danger an hour ago | parent | prev [-]

What is your definition of "important" in this context?

▲ CamperBob2 3 hours ago | parent | prev | next [-]

Good results with your Q8_0 version on 96GB RTX 6000 Blackwell. It one-shotted the Flappy Bird game and also wrote a good Wordle clone in four shots, all at over 60 tps. Thanks!

Is your Q8_0 file the same as the one hosted directly on the Qwen GGUF page?

▲ binsquare 7 hours ago | parent | prev [-]

How did you do it so fast?

Great work as always btw!

	▲	danielhanchen 6 hours ago \| parent [-]
		Thanks! :) We're early access partners with them!