| ▲ | porridgeraisin 10 hours ago | |
There's an obvious baseline which seems missing If you sample from the base model with T=1.6, top_k=20, top_p=0.8, i.e, the decode settings used for the distillation's ground truth, does it match the SSD'd model + some decoding? Performance wise. Their sweep is missing this. And only covers "standard" decoding settings. | ||