Remix.run Logo
onlyrealcuzzo 4 days ago

Isn't the whole point of the MOE architecture exactly this?

That you can individually train and improve smaller segments as necessary

ainch 4 days ago | parent | next [-]

Generally you train each expert simultaneously. The benefit of MoEs is that you get cheap inference because you only use the active expert parameters, which constitute a small fraction of the total parameter count. For example Deepseek R1 (which is especially sparse) only uses 1/18th of the total parameters per-query.

pama 4 days ago | parent [-]

> only uses 1/18th of the total parameters per-query.

only uses 1/18th of the total parameters per token. It may use the large fraction of them in a single query.

ainch 3 days ago | parent [-]

That's a good correction, thanks.

idiotsecant 4 days ago | parent | prev [-]

I think it's the exact opposite - you don't specifically train each 'expert' to be a SME at something. Each of the experts is a generalist but becomes better at portions of tasks in a distributed way. There is no 'best baker', but things evolve toward 'best applier of flour', 'best kneader', etc. I think explicitly domain-trained experts are pretty uncommon in modern schemes.

viraptor 4 days ago | parent [-]

That's not entirely correct. Most of moe right now are fully balanced, but there is an idea of a domain expert moe where the training benefits fewer switches. https://arxiv.org/abs/2410.07490

idiotsecant 3 days ago | parent [-]

Yes, explicitly trained experts were a thing for a little while, but not anymore. Yet another application of the Hard Lesson.