Remix clone Hacker News

new | show | ask | jobs Github

	▲	magicalhippo 10 hours ago
		For reference in case it's interesting to someone, a 5090 on Windows 11 with CUDA 13.1 \| model \| size \| params \| backend \| ngl \| test \| t/s \| \| --------------------- \| ---------: \|--------: \| -------- \| --: \|------: \|----------------: \| \| gpt-oss 20B MXFP4 MoE \| 11.27 GiB \| 20.91 B \| CUDA \| 999 \| pp2048 \| 10179.12 ± 52.86 \| \| gpt-oss 20B MXFP4 MoE \| 11.27 GiB \| 20.91 B \| CUDA \| 999 \| tg128 \| 326.82 ± 7.82 \| \| qwen35 27B Q6_K \| 23.87 GiB \| 26.90 B \| CUDA \| 999 \| pp2048 \| 3129.92 ± 5.12 \| \| qwen35 27B Q6_K \| 23.87 GiB \| 26.90 B \| CUDA \| 999 \| tg128 \| 53.45 ± 0.15 \| build: 9d34231bb (8929) gpt-oss-20b-MXFP4.gguf Qwen3.6-27B-UD-Q6_K_XL.gguf Using MXFP4 of GPT-OSS because it was trained quantization-aware for this quantization type, and it's native to the 50xx.
	▲	ycui7 2 hours ago \| parent [-]
		You can get 120TPS (144 peak) with Qwen3.6-27B on RTX PRO 6000 with autoround when MTP enabled. It runs faster than sonnet api calls. 5090 gets maybe 100TPS with MTP