▲ | bigyabai 4 days ago | |
I run the 30B Qwen3 on my 8GB Nvidia GPU and get a shockingly high tok/s. | ||
▲ | EnPissant 4 days ago | parent [-] | |
For contrast, I get the following for a rtx 5090 and 30b qwen3 coder quantized to ~4 bits: - Prompt processing 65k tokens: 4818 tokens/s - Token generation 8k tokens: 221 tokens/s If I offload just the experts to run on the CPU I get: - Prompt processing 65k tokens: 3039 tokens/s - Token generation 8k tokens: 42.85 tokens/s As you can see, token generation is over 5x slower. This is only using ~5.5GB VRAM, so the token generation could be sped up a small amount by moving a few of the experts onto the GPU. |