Remix.run Logo
nbardy 3 hours ago

You can estimate on tok/second

The Trillions of parameters claim is about the pretraining.

It’s most efficient in pre training to train the biggest models possible. You get sample efficiency increase for each parameter increase.

However those models end up very sparse and incredibly distillable.

And it’s way too expensive and slow to serve models that size so they are distilled down a lot.