| ▲ | danielhanchen 8 hours ago |
| For those interested, made some Dynamic Unsloth GGUFs for local deployment at https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF and made a guide on using Claude Code / Codex locally: https://unsloth.ai/docs/models/qwen3-coder-next |
|
| ▲ | genpfault 6 hours ago | parent | next [-] |
| Nice! Getting ~39 tok/s @ ~60% GPU util. (~170W out of 303W per nvtop). System info: $ ./llama-server --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
version: 7897 (3dd95914d)
built with GNU 11.4.0 for Linux x86_64
llama.cpp command-line: $ ./llama-server --host 0.0.0.0 --port 2000 --no-warmup \
-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
--jinja --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --fit on \
--ctx-size 32768
|
| |
| ▲ | halcyonblue 5 hours ago | parent [-] | | What am I missing here? I thought this model needs 46GB of unified memory for 4-bit quant. Radeon RX 7900 XTX has 24GB of memory right? Hoping to get some insight, thanks in advance! | | |
| ▲ | coder543 5 hours ago | parent [-] | | MoEs can be efficiently split between dense weights (attention/KV/etc) and sparse (MoE) weights. By running the dense weights on the GPU and offloading the sparse weights to slower CPU RAM, you can still get surprisingly decent performance out of a lot of MoEs. Not as good as running the entire thing on the GPU, of course. |
|
|
|
| ▲ | bityard 5 hours ago | parent | prev | next [-] |
| Hi Daniel, I've been using some of your models on my Framework Desktop at home. Thanks for all that you do. Asking from a place of pure ignorance here, because I don't see the answer on HF or in your docs: Why would I (or anyone) want to run this instead of Qwen3's own GGUFs? |
|
| ▲ | MrDrMcCoy an hour ago | parent | prev | next [-] |
| Still hoping IQuest-Coder gets the same treatment :) |
|
| ▲ | ranger_danger 7 hours ago | parent | prev | next [-] |
| What is the difference between the UD and non-UD files? |
| |
| ▲ | danielhanchen 7 hours ago | parent [-] | | UD stands for "Unsloth-Dynamic" which upcasts important layers to higher bits. Non UD is just standard llama.cpp quants. Both still use our calibration dataset. | | |
| ▲ | CamperBob2 6 hours ago | parent | next [-] | | Please consider authoring a single, straightforward introductory-level page somewhere that explains what all the filename components mean, and who should use which variants. The green/yellow/red indicators for different levels of hardware support are really helpful, but far from enough IMO. | | |
| ▲ | danielhanchen 6 hours ago | parent | next [-] | | Oh good idea! In general UD-Q4_K_XL (Unsloth Dynamic 4bits Extra Large) is what I generally recommend for most hardware - MXFP4_MOE is also ok | | |
| ▲ | Keats 5 hours ago | parent [-] | | Is there some indication on how the different bit quantization affect performance? IE I have a 5090 + 96GB so I want to get the best possible model but I don't care about getting 2% better perf if I only get 5 tok/s. | | |
| ▲ | mirekrusin 3 hours ago | parent [-] | | It takes download time + 1 minute to test speed yourself, you can try different quants, it's hard to write down a table because it depends on your system ie. ram clock etc. if you go out of gpu. I guess it would make sense to have something like max context size/quants that fit fully on common configs with gpus, dual gpus, unified ram on mac etc. | | |
| ▲ | Keats 3 hours ago | parent [-] | | Testing speed is easy yes, I'm mostly wondering about the quality difference between Q6 vs Q8_K_XL for example. |
|
|
| |
| ▲ | segmondy 6 hours ago | parent | prev [-] | | The green/yellow/red indicators are based on what you set for your hardware on huggingface. |
| |
| ▲ | ranger_danger an hour ago | parent | prev [-] | | What is your definition of "important" in this context? |
|
|
|
| ▲ | CamperBob2 3 hours ago | parent | prev | next [-] |
| Good results with your Q8_0 version on 96GB RTX 6000 Blackwell. It one-shotted the Flappy Bird game and also wrote a good Wordle clone in four shots, all at over 60 tps. Thanks! Is your Q8_0 file the same as the one hosted directly on the Qwen GGUF page? |
|
| ▲ | binsquare 7 hours ago | parent | prev [-] |
| How did you do it so fast? Great work as always btw! |
| |