| ▲ | binyu 4 hours ago | |||||||||||||||||||||||||||||||
> Now at 40-50tok/s generation and ~2000 tok/s Not clear how you went from ~11-14 to ~40-50 tok/s. Is it by running the quant native model and adding a second Spark? Cheers | ||||||||||||||||||||||||||||||||
| ▲ | ttsiodras 2 hours ago | parent | next [-] | |||||||||||||||||||||||||||||||
Just chiming in - the claims above are real, I have very similar numbers in a cluster of 2x GX10 I have access to. Instructions to reproduce, and benchmarks here: https://forums.developer.nvidia.com/t/deepseek-v4-flash-offi... | ||||||||||||||||||||||||||||||||
| ▲ | wolttam 4 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||
I suspect DwarfStar could probably squeeze more performance out of the single spark, maybe up closer to 20tok/s. Moving to 2 sparks meant switching to vLLM with 2-way tensor parallelism and working multi-token prediction. The parallelism and MTP on top of better tuned kernels[1] gave an extremely nice boost! I was quite pleased. I've seen bursts up to 60tok/s at ~150k context - sometimes the MTP seems to really kick in (i.e. high acceptance rate on its tokens) Currently running a custom vLLM build put together by some folks on the Nvidia forums[2], which speaks to how early support for the model is. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||