Remix clone Hacker News

new | show | ask | jobs Github

	▲	awestroke 6 hours ago
		I don't understand how you can compare against the base model output without generating with the base model, in which case what's the point?
	▲	radarsat1 3 hours ago \| parent \| next [-]
		Because the nature of transformers is that running a bunch of pregenerated tokens through them is a parallel operation, not autoregressive. That's how it works at training time, but speculative decoding uses it at inference time. So if you just want to check whether a set of known tokens is "likely" given the base model, you can run them all through and get probability distributions, no need to sample. It's the same reason there's a difference in speed between "prompt processing" and "generation". The former is just taking the pre-generated prompt and building the KV cache, which is parallel, not autoregressive and therefore way faster.
	▲	qeternity 5 hours ago \| parent \| prev \| next [-]
		I haven't read TFA yet but a common technique is speculative decoding where a fast draft model will generate X tokens, which are then verified by the larger target model. The target model may accept some Y <= X tokens but the speedup comes from the fact that this can be done in parallel as a prefill operation due to the nature of transformers. So let's say a draft model generates 5 tokens, all 5 of these can be verified in parallel with a single forward pass of the target model. The target model may only accept the first 4 tokens (or whatever) but as long as the 5 forward passes of the draft model + 1 prefill of the target model is faster than 4 forward passes of the target, you will have a speedup while maintaining the exact output distribution as the target.
	▲	a1j9o94 5 hours ago \| parent \| prev \| next [-]
		You would only use the base model during training. This is a distillation technique
	▲	Balinares 4 hours ago \| parent \| prev \| next [-]
		Isn't that exactly how draft models speed up inference, though? Validating a batch of tokens is significantly faster than generating them.
	▲	anentropic 5 hours ago \| parent \| prev [-]
		presumably that happens at training time? then once successfully trained you get faster inference from just the diffusion model