Remix.run Logo
porridgeraisin 10 hours ago

There's an obvious baseline which seems missing

If you sample from the base model with T=1.6, top_k=20, top_p=0.8, i.e, the decode settings used for the distillation's ground truth, does it match the SSD'd model + some decoding? Performance wise.

Their sweep is missing this. And only covers "standard" decoding settings.