Remix clone Hacker News

new | show | ask | jobs Github

	▲	nbardy 3 hours ago
		You can estimate on tok/second The Trillions of parameters claim is about the pretraining. It’s most efficient in pre training to train the biggest models possible. You get sample efficiency increase for each parameter increase. However those models end up very sparse and incredibly distillable. And it’s way too expensive and slow to serve models that size so they are distilled down a lot.