Remix clone Hacker News

new | show | ask | jobs Github

	▲	porridgeraisin 10 hours ago
		There's an obvious baseline which seems missing If you sample from the base model with T=1.6, top_k=20, top_p=0.8, i.e, the decode settings used for the distillation's ground truth, does it match the SSD'd model + some decoding? Performance wise. Their sweep is missing this. And only covers "standard" decoding settings.