Isn't the whole point of the MOE architecture exactly this?

That you can individually train and improve smaller segments as necessary

Generally you train each expert simultaneously. The benefit of MoEs is that you get cheap inference because you only use the active expert parameters, which constitute a small fraction of the total parameter count. For example Deepseek R1 (which is especially sparse) only uses 1/18th of the total parameters per-query.

▲

pama 4 days ago | parent [-]

> only uses 1/18th of the total parameters per-query.

only uses 1/18th of the total parameters per token. It may use the large fraction of them in a single query.

	▲	ainch 3 days ago \| parent [-]
		That's a good correction, thanks.

▲

idiotsecant 4 days ago | parent | prev [-]

I think it's the exact opposite - you don't specifically train each 'expert' to be a SME at something. Each of the experts is a generalist but becomes better at portions of tasks in a distributed way. There is no 'best baker', but things evolve toward 'best applier of flour', 'best kneader', etc. I think explicitly domain-trained experts are pretty uncommon in modern schemes.

▲

viraptor 4 days ago | parent [-]

That's not entirely correct. Most of moe right now are fully balanced, but there is an idea of a domain expert moe where the training benefits fewer switches. https://arxiv.org/abs/2410.07490

	▲	idiotsecant 3 days ago \| parent [-]
		Yes, explicitly trained experts were a thing for a little while, but not anymore. Yet another application of the Hard Lesson.