▲ | mendeza 3 days ago | |
what is the throughput for gpt-oss, 1 token every 2 seconds is really slow, but understandable because you are moving cache to disk | ||
▲ | anuarsh 3 days ago | parent [-] | |
1tok/2s is the best I got on my PC, thanks to MoE architecture of qwen3-next-80B. gpt-oss-20B is slower because I load all single layer experts to GPU and unpack weights (4bit -> bf16) each time. While with qwen3-next I load only active experts (normally 150 out of 512). Probably I could do the same with gpt-oss. |