▲ | tough 7 days ago | |||||||
gpt-oss-120b can be used with gpt-oss-20b as speculative drafting on LM Studio I'm not sure it improved the speed much | ||||||||
▲ | roadside_picnic 7 days ago | parent | next [-] | |||||||
To measure the performance gains on a local machine (or even standard cloud GPU setup), since you can't run this in parallel with the same efficiency you could in a high-ed data center, you need to compare the number of calls made to each model. In my experiences I'd seen the calls to the target model reduced to a third of what they would have been without using a draft model. You'll still get some gains on a local model, but they won't be near what they could be theoretically if everything is properly tuned for performance. It also depends on the type of task. I was working with pretty structured data with lots of easy to predict tokens. | ||||||||
▲ | qcnguy 6 days ago | parent | prev | next [-] | |||||||
It depends a lot on the type of conversation. A lot of ChatGPT load appears to be therapy talk that even small models can correctly predict. | ||||||||
▲ | vrm 7 days ago | parent | prev [-] | |||||||
a 6:1 parameter ratio is too small for specdec to have that much of an effect. You'd really want to see 10:1 or even more for this to start to matter | ||||||||
|