▲ | imtringued 9 days ago | |
You're forgetting that some sequences are more predictable than others, hence the name "speculative" decoding. Let's say your token encoding has 128k tokens. That means the model has to pick the right token out of 128k. Some of those tokens are incredibly rare, while others are super common. The big model has seen the rare tokens many more times than the small model. This means that the small model will be able to do things like produce grammatically correct English, but not know anything about a specific JS framework. The post training fine tuning costs (low thousand dollars) are the main reason why speculative decoding is relatively unpopular. The most effective speculative decoding strategy requires you to train multiple prediction heads ala medusa (or whatever succeeded it). If you don't do any fine tuning, then the probability of the small model being useful is slim. Using a random model as your draft model will probably give you very disappointing results. |