| ▲ | kbumsik 3 hours ago |
| > performance often degrades under different chat templates, long-context inputs, or out-of-distribution system prompts. I heard that speculative decoding doesn't affect performance (I meant accuracy). Am I wrong about it? |
|
| ▲ | ketchup32613 3 hours ago | parent [-] |
| You're not wrong about that. Speculative decoding does not affect the quality of tokens generated, as each token has to be verified by the parent model before it is output. Each of the tokens generated by the draft model has to be verified by the parent/original model, but if this acceptance rate falls, then the speedup from speculative decoding would be eliminated. This acceptance rate, and more directly the speedup from draft models, is what "performance" refer
s to in the article. |
| |
| ▲ | kbumsik 3 hours ago | parent [-] | | So the draft model's performance is directly linked to the overall speed. Thank you for the explanation! By the way, can it be slower than without speculative decoding in worst case then? | | |
| ▲ | daemonologist 2 hours ago | parent [-] | | > can it be slower than without speculative decoding in worst case then?
Yes - running the draft model costs compute and memory bandwidth, and running the drafted futures through the main model costs compute. If the draft model were really inaccurate or you're already compute-limited (usually: running large batches) you would expect some slowdown.In practice, for single-user (non-batched) inference with a working configuration, you pretty much always get some speedup. For non-coding tasks I've seen it be nearly a wash for some people, in which case you might want to avoid it due to the extra memory usage (you'd rather use that memory to run a bigger quant/model, even at a slightly lower speed). |
|
|