Generally you train each expert simultaneously. The benefit of MoEs is that you get cheap inference because you only use the active expert parameters, which constitute a small fraction of the total parameter count. For example Deepseek R1 (which is especially sparse) only uses 1/18th of the total parameters per-query.

▲

pama 4 days ago | parent [-]

> only uses 1/18th of the total parameters per-query.

only uses 1/18th of the total parameters per token. It may use the large fraction of them in a single query.

	▲	ainch 3 days ago \| parent [-]
		That's a good correction, thanks.