Remix.run Logo
ainch 4 days ago

Generally you train each expert simultaneously. The benefit of MoEs is that you get cheap inference because you only use the active expert parameters, which constitute a small fraction of the total parameter count. For example Deepseek R1 (which is especially sparse) only uses 1/18th of the total parameters per-query.

pama 4 days ago | parent [-]

> only uses 1/18th of the total parameters per-query.

only uses 1/18th of the total parameters per token. It may use the large fraction of them in a single query.

ainch 3 days ago | parent [-]

That's a good correction, thanks.