Remix clone Hacker News

new | show | ask | jobs Github

	▲	bigyabai 4 days ago
		I run the 30B Qwen3 on my 8GB Nvidia GPU and get a shockingly high tok/s.
	▲	EnPissant 4 days ago \| parent [-]
		For contrast, I get the following for a rtx 5090 and 30b qwen3 coder quantized to ~4 bits: - Prompt processing 65k tokens: 4818 tokens/s - Token generation 8k tokens: 221 tokens/s If I offload just the experts to run on the CPU I get: - Prompt processing 65k tokens: 3039 tokens/s - Token generation 8k tokens: 42.85 tokens/s As you can see, token generation is over 5x slower. This is only using ~5.5GB VRAM, so the token generation could be sped up a small amount by moving a few of the experts onto the GPU.