But what would happen if the small model's prediction was "is Rome."? Wouldn't that result in costlier inference if the small model is "wrong" more than it is correct.

Also, if the small model would be sufficiently more "correct" than "wrong", wouldn't be more efficient to get rid of the large model at this point?

▲

imtringued 9 days ago | parent | next [-]

You're forgetting that some sequences are more predictable than others, hence the name "speculative" decoding. Let's say your token encoding has 128k tokens. That means the model has to pick the right token out of 128k. Some of those tokens are incredibly rare, while others are super common. The big model has seen the rare tokens many more times than the small model. This means that the small model will be able to do things like produce grammatically correct English, but not know anything about a specific JS framework.

The post training fine tuning costs (low thousand dollars) are the main reason why speculative decoding is relatively unpopular. The most effective speculative decoding strategy requires you to train multiple prediction heads ala medusa (or whatever succeeded it). If you don't do any fine tuning, then the probability of the small model being useful is slim. Using a random model as your draft model will probably give you very disappointing results.

▲

acters 9 days ago | parent | prev | next [-]

I believe that is exactly the downside of using speculative decoding, which is why it is very important to have the models properly sized between each other by making sure the small use is big enough to be mostly correct while also being exceptionally faster than the larger one. However the larger one has to be fast enough that catching flaws won't introduce too manyrandom delays. Also, if the small one is incorrect then the larger one correcting the mistake is miles better than leaving in incorrect output.

It is about improving quality while allowing for faster speed most of the time. The tradeoff is that you consume more memory from having two models loaded vs one of them exclusively.

If you just focus on one then it would make sense to reduce memory usage by just running the smaller model.

	▲	acters 9 days ago \| parent [-]
		Another caveat with this method is that both larger and smaller models need to behave very similar because a lot of the savings come from generating the necessary fluff around each detail such as grammar, formatting and words/letters that transition between each other. Unsurprisingly gpt-oss has both larger and smaller models that work very similarly! Both model sizes are so similar that even if getting a few wrong would not be slowing down the performance enough to equal the speed of the larger model(which is the worst case with this setup). We want the speed of the smaller model as much as possible. That is all

▲

cwyers 9 days ago | parent | prev | next [-]

So, the way speculative decoding works, the model begins predicting at the first wrong token, so you still get 'is' for free.

▲

9 days ago | parent | prev [-]

[deleted]