Remix.run Logo
awestroke 6 hours ago

I don't understand how you can compare against the base model output without generating with the base model, in which case what's the point?

radarsat1 3 hours ago | parent | next [-]

Because the nature of transformers is that running a bunch of pregenerated tokens through them is a parallel operation, not autoregressive. That's how it works at training time, but speculative decoding uses it at inference time. So if you just want to check whether a set of known tokens is "likely" given the base model, you can run them all through and get probability distributions, no need to sample.

It's the same reason there's a difference in speed between "prompt processing" and "generation". The former is just taking the pre-generated prompt and building the KV cache, which is parallel, not autoregressive and therefore way faster.

qeternity 5 hours ago | parent | prev | next [-]

I haven't read TFA yet but a common technique is speculative decoding where a fast draft model will generate X tokens, which are then verified by the larger target model. The target model may accept some Y <= X tokens but the speedup comes from the fact that this can be done in parallel as a prefill operation due to the nature of transformers.

So let's say a draft model generates 5 tokens, all 5 of these can be verified in parallel with a single forward pass of the target model. The target model may only accept the first 4 tokens (or whatever) but as long as the 5 forward passes of the draft model + 1 prefill of the target model is faster than 4 forward passes of the target, you will have a speedup while maintaining the exact output distribution as the target.

a1j9o94 5 hours ago | parent | prev | next [-]

You would only use the base model during training. This is a distillation technique

Balinares 4 hours ago | parent | prev | next [-]

Isn't that exactly how draft models speed up inference, though? Validating a batch of tokens is significantly faster than generating them.

anentropic 5 hours ago | parent | prev [-]

presumably that happens at training time?

then once successfully trained you get faster inference from just the diffusion model