| ▲ | onlyrealcuzzo 4 hours ago | |
It's useful at the local level, where there will be SOTA models developed... | ||
| ▲ | zozbot234 3 hours ago | parent [-] | |
Local models are moving towards batched inference too, if only for non-interactive use. An early experimental patchset for DS4 (running DeepSeek V4 Flash) seems to show 2x aggregate tok/s decode when processing 8 streams concurrently, and more than 3x when processing as many as 32 streams concurrently. Note that prefill (which is not helped significantly by this change) then becomes a larger fraction of total wall-clock time, so the overall gain is lower (i.e. prefill is akin to a 'serial' task wrt. Amdahl's law). MTP will still be highly valuable for interactive use of course. | ||