| ▲ | jasonjmcghee 3 hours ago | |
This is not correct. Families of model sizes work great for speculative decoding. Use the 1B with the 32B or whatever. It's a balance as you want it to be guessing correctly as much as possible but also be as fast as possible. Validation takes time and every guess needs to be validated etc The model you're using to speculate could be anything, but if it's not guessing what the main model would predict, it's useless. | ||