This feels a bit pessimistic. Qwen 3.5 35B-A3B runs at 38 t/s tg with llama.cpp (mmap enabled) on my Radeon 6800 XT.
At what quantization and with what size context window?
Looks like it's a bit slower today. Running llama.cpp b8192 Vulkan.
$ ./llama-cli unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -c 65536 -p "Hello"
[snip 73 lines]
[ Prompt: 86,6 t/s | Generation: 34,8 t/s ]
$ ./llama-cli unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -c 262144 -p "Hello"
[snip 128 lines]
[ Prompt: 78,3 t/s | Generation: 30,9 t/s ]
I suspect the ROCm build will be faster, but it doesn't work out of the box for me.