▲ | bhaney 9 days ago | |
> How does the target model validate the draft tokens without running the inference as normal? It does run the inference as normal, just in parallel with the other inferences > if it is doing just that, I don't get the point Running inferences in parallel allows you to only read the model weights out of memory only once for N parallel inferences, as opposed to reading them out of memory N times for N serial inferences. Inference is massively bottlenecked by memory bandwidth to the tune of one or two orders of magnitude compared to compute, so this helps a lot. | ||
▲ | littlestymaar 9 days ago | parent [-] | |
> Inference is massively bottlenecked by memory bandwidth to the tune of one or two orders of magnitude compared to compute, so this helps a lot. Nitpick: it's only bottlenecked by memory bandwidth if the batch size is too low (that is: if you don't have many users calling the same model in parallel). Speculative decoding is just a way of running a single query as if it was parallel queries. |