▲ | martinald 5 days ago | |||||||
Yes. I was really surprised at this myself (author here). If you have some better numbers I'm all ears. Even on my lowly 9070XT I get 20x the tok/s input vs output, and I'm not doing batching or anything locally. I think the cache hit vs miss stuff makes sense at >100k tokens where you start getting compute bound. | ||||||||
▲ | jsnell 5 days ago | parent | next [-] | |||||||
I linked to the writeup by Deepseek with their actual numbers from production, and you want "better numbers" than that?! > Each H800 node delivers an average throughput of ~73.7k tokens/s input (including cache hits) during prefilling or ~14.8k tokens/s output during decoding. That's a 5x difference, not 1000x. It also lines up with their pricing, as one would expect. (The decode throughputs they give are roughly equal to yours, but you're claiming a prefill performance 200x times higher than they can achieve.) | ||||||||
| ||||||||
▲ | Filligree 5 days ago | parent | prev [-] | |||||||
Maybe because you aren’t doing batching? It sounds like you’re assuming that would benefit prefill more than decode, but I believe it’s the other way around. |