Remix clone Hacker News

new | show | ask | jobs Github

	▲	mendeza 3 days ago
		what is the throughput for gpt-oss, 1 token every 2 seconds is really slow, but understandable because you are moving cache to disk
	▲	anuarsh 3 days ago \| parent [-]
		1tok/2s is the best I got on my PC, thanks to MoE architecture of qwen3-next-80B. gpt-oss-20B is slower because I load all single layer experts to GPU and unpack weights (4bit -> bf16) each time. While with qwen3-next I load only active experts (normally 150 out of 512). Probably I could do the same with gpt-oss.