Yes. I was really surprised at this myself (author here). If you have some better numbers I'm all ears. Even on my lowly 9070XT I get 20x the tok/s input vs output, and I'm not doing batching or anything locally.

I think the cache hit vs miss stuff makes sense at >100k tokens where you start getting compute bound.

▲

jsnell 5 days ago | parent | next [-]

I linked to the writeup by Deepseek with their actual numbers from production, and you want "better numbers" than that?!

> Each H800 node delivers an average throughput of ~73.7k tokens/s input (including cache hits) during prefilling or ~14.8k tokens/s output during decoding.

That's a 5x difference, not 1000x. It also lines up with their pricing, as one would expect.

(The decode throughputs they give are roughly equal to yours, but you're claiming a prefill performance 200x times higher than they can achieve.)

	▲	smarterclayton 5 days ago \| parent [-]
		A good rule of thumb is that a prefill token is about 1/6th the compute cost of decode token, and that you can get about 15k prefill tokens a second on Llama3 8B on a single H100. Bigger models will require more compute per token, and quantization like FP8 or FP4 will require less.

▲

Filligree 5 days ago | parent | prev [-]

Maybe because you aren’t doing batching? It sounds like you’re assuming that would benefit prefill more than decode, but I believe it’s the other way around.