| ▲ | wolttam 4 hours ago | |||||||
I suspect DwarfStar could probably squeeze more performance out of the single spark, maybe up closer to 20tok/s. Moving to 2 sparks meant switching to vLLM with 2-way tensor parallelism and working multi-token prediction. The parallelism and MTP on top of better tuned kernels[1] gave an extremely nice boost! I was quite pleased. I've seen bursts up to 60tok/s at ~150k context - sometimes the MTP seems to really kick in (i.e. high acceptance rate on its tokens) Currently running a custom vLLM build put together by some folks on the Nvidia forums[2], which speaks to how early support for the model is. | ||||||||
| ▲ | binyu 4 hours ago | parent | next [-] | |||||||
Personally, I've tried to squeeze more tok/s for a single DGX Spark deployment and DeepSeek V4 Flash but only got marginal improvements. There's work to do on fusing kernels and other optimizations that are already on antirez's roadmap so it is not worth duplicating efforts. I've had positive experiences running GLM 4.7 via vLLM, tool calling works well and the inference is fast. Do you run DeepSeek V4 Flash on vLLM? | ||||||||
| ||||||||
| ▲ | doctorpangloss an hour ago | parent | prev [-] | |||||||
DeepSeek v4 Flash MTP is a training optimization. It doesn't make inference run faster, it must run the entire model forward as the "verifier." This is in the paper, and this is why the docs they release do not mention using it for accelerated inference. Eventually, I'm going to stop writing stuff like this @dang, because even though it is literally being read by a human, it's going to just be copy and pasted into a chatbot, which will actually spend the time trying to comprehend what I am saying. | ||||||||
| ||||||||