| ▲ | magicalhippo 10 hours ago | |
For reference in case it's interesting to someone, a 5090 on Windows 11 with CUDA 13.1
Using MXFP4 of GPT-OSS because it was trained quantization-aware for this quantization type, and it's native to the 50xx. | ||
| ▲ | ycui7 2 hours ago | parent [-] | |
You can get 120TPS (144 peak) with Qwen3.6-27B on RTX PRO 6000 with autoround when MTP enabled. It runs faster than sonnet api calls. 5090 gets maybe 100TPS with MTP | ||