(Note UPDATED config)
Ya, if you are using the CPU it may slowdown quick.
This may be a bit huge and overcomplicated, on this host I am running it on a AMD Ryzen 7 5700G so that I can use the APU to dedicate the 3090.
podman run --device nvidia.com/gpu=all -d -v llama_qwen3.6mpt:/root/.cache -p 8080:8080 local/llama.cpp:full-cuda --server \
-hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
-ngl 99 \
--ctx-size 131072 \
--no-mmproj-offload \
--no-context-shift \
--kv-unified \
--spec-type draft-mtp \
--spec-draft-n-max 6 \
--spec-draft-p-min 0.75 \
-fa on --jinja --no-mmap \
--cache-ram -1 \
--no-warmup -np 1 \
-n 32768 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--temp 0.6 \
--min-p 0.00 \
--top-k 20 \
--top-p 0.95 \
--presence-penalty 0.0 \
--repeat-penalty 1.05 \
--fit off \
--reasoning on \
--chat-template-kwargs '{"preserve_thinking":true}' \
--prio 3 \
--poll 100 \
--port 8080 \
--host 0.0.0.0
I am just building the container with: podman build -t local/llama.cpp:full-cuda --target full -f .devops/cuda.Dockerfile .
And here is the logs from a 'make me a flappy bird program in python' webui prompt. prompt eval time = 105.86 ms / 19 tokens ( 5.57 ms per token, 179.47 tokens per second)
eval time = 100549.41 ms / 4608 tokens ( 21.82 ms per token, 45.83 tokens per second)
total time = 100655.28 ms / 4627 tokens
draft acceptance rate = 0.47215 ( 3408 accepted / 7218 generated)
I am down to ~25.54 t/s with a 95% full context.