Remix clone Hacker News

new | show | ask | jobs Github

	▲	jsnell 5 days ago
		I linked to the writeup by Deepseek with their actual numbers from production, and you want "better numbers" than that?! > Each H800 node delivers an average throughput of ~73.7k tokens/s input (including cache hits) during prefilling or ~14.8k tokens/s output during decoding. That's a 5x difference, not 1000x. It also lines up with their pricing, as one would expect. (The decode throughputs they give are roughly equal to yours, but you're claiming a prefill performance 200x times higher than they can achieve.)
	▲	smarterclayton 5 days ago \| parent [-]
		A good rule of thumb is that a prefill token is about 1/6th the compute cost of decode token, and that you can get about 15k prefill tokens a second on Llama3 8B on a single H100. Bigger models will require more compute per token, and quantization like FP8 or FP4 will require less.