▲ | roadside_picnic 7 days ago | ||||||||||||||||||||||||||||
I'm sure there are countless tricks, but one that can implemented at home, and I know plays a major part in Cerebras' performance is: speculative decoding. Speculative decoding uses a smaller draft model to generate tokens with much less compute and memory required. Then the main model will accept those tokens based on the probability it would have generated them. In practice this case easily result in a 3x speedup in inference. Another trick for structured outputs that I know of is "fast forwarding" where you can skip tokens if you know they are going to be the only acceptable outputs. For example, you know that when generating JSON you need to start with `{ "<first key>": ` etc. This can also lead to a ~3x speedup in when responding in JSON. | |||||||||||||||||||||||||||||
▲ | tough 7 days ago | parent [-] | ||||||||||||||||||||||||||||
gpt-oss-120b can be used with gpt-oss-20b as speculative drafting on LM Studio I'm not sure it improved the speed much | |||||||||||||||||||||||||||||
|