Remix.run Logo
mendeza 3 days ago

what is the throughput for gpt-oss, 1 token every 2 seconds is really slow, but understandable because you are moving cache to disk

anuarsh 3 days ago | parent [-]

1tok/2s is the best I got on my PC, thanks to MoE architecture of qwen3-next-80B. gpt-oss-20B is slower because I load all single layer experts to GPU and unpack weights (4bit -> bf16) each time. While with qwen3-next I load only active experts (normally 150 out of 512). Probably I could do the same with gpt-oss.