I think it's the exact opposite - you don't specifically train each 'expert' to be a SME at something. Each of the experts is a generalist but becomes better at portions of tasks in a distributed way. There is no 'best baker', but things evolve toward 'best applier of flour', 'best kneader', etc. I think explicitly domain-trained experts are pretty uncommon in modern schemes.

▲

viraptor 4 days ago | parent [-]

That's not entirely correct. Most of moe right now are fully balanced, but there is an idea of a domain expert moe where the training benefits fewer switches. https://arxiv.org/abs/2410.07490

	▲	idiotsecant 3 days ago \| parent [-]
		Yes, explicitly trained experts were a thing for a little while, but not anymore. Yet another application of the Hard Lesson.