Remix.run Logo
bhaney 9 days ago

> How does the target model validate the draft tokens without running the inference as normal?

It does run the inference as normal, just in parallel with the other inferences

> if it is doing just that, I don't get the point

Running inferences in parallel allows you to only read the model weights out of memory only once for N parallel inferences, as opposed to reading them out of memory N times for N serial inferences. Inference is massively bottlenecked by memory bandwidth to the tune of one or two orders of magnitude compared to compute, so this helps a lot.

littlestymaar 9 days ago | parent [-]

> Inference is massively bottlenecked by memory bandwidth to the tune of one or two orders of magnitude compared to compute, so this helps a lot.

Nitpick: it's only bottlenecked by memory bandwidth if the batch size is too low (that is: if you don't have many users calling the same model in parallel).

Speculative decoding is just a way of running a single query as if it was parallel queries.