Remix.run Logo
btown 4 hours ago

For speculative decoding, wouldn’t this be of limited use for frontier models that don’t have the same tokenizer as Llama 3.1? Or would it be so good that retokenization/bridging would be worth it?

Zetaphor 3 hours ago | parent | next [-]

My understanding as well is that speculative decoding only works with a smaller quant of the same model. You're using the faster sampling of the smaller models representation of the larger models weights in order to attempt to accurately predict its token output. This wouldn't work cross-model as the token probabilities are completely different.

jasonjmcghee 2 hours ago | parent | next [-]

This is not correct.

Families of model sizes work great for speculative decoding. Use the 1B with the 32B or whatever.

It's a balance as you want it to be guessing correctly as much as possible but also be as fast as possible. Validation takes time and every guess needs to be validated etc

The model you're using to speculate could be anything, but if it's not guessing what the main model would predict, it's useless.

ashirviskas 3 hours ago | parent | prev [-]

Smaller quant or smaller model?

Afaik it can work with anything, but sharing vocab solves a lot of headaches and the better token probs match, the more efficient it gets.

Which is why it is usually done with same family models and most often NOT just different quantizations of the same model.

vessenes 3 hours ago | parent | prev [-]

I think they’d commission a quant directly. Benefits go down a lot when you leave model families.