Remix.run Logo
genpfault 6 hours ago

Nice! Getting ~39 tok/s @ ~60% GPU util. (~170W out of 303W per nvtop).

System info:

    $ ./llama-server --version
    ggml_vulkan: Found 1 Vulkan devices:
    ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
    version: 7897 (3dd95914d)
    built with GNU 11.4.0 for Linux x86_64
llama.cpp command-line:

    $ ./llama-server --host 0.0.0.0 --port 2000 --no-warmup \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --jinja --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --fit on \
    --ctx-size 32768
halcyonblue 5 hours ago | parent [-]

What am I missing here? I thought this model needs 46GB of unified memory for 4-bit quant. Radeon RX 7900 XTX has 24GB of memory right? Hoping to get some insight, thanks in advance!

coder543 5 hours ago | parent [-]

MoEs can be efficiently split between dense weights (attention/KV/etc) and sparse (MoE) weights. By running the dense weights on the GPU and offloading the sparse weights to slower CPU RAM, you can still get surprisingly decent performance out of a lot of MoEs.

Not as good as running the entire thing on the GPU, of course.