| ▲ | onlyrealcuzzo 4 days ago | ||||||||||||||||
Isn't the whole point of the MOE architecture exactly this? That you can individually train and improve smaller segments as necessary | |||||||||||||||||
| ▲ | ainch 4 days ago | parent | next [-] | ||||||||||||||||
Generally you train each expert simultaneously. The benefit of MoEs is that you get cheap inference because you only use the active expert parameters, which constitute a small fraction of the total parameter count. For example Deepseek R1 (which is especially sparse) only uses 1/18th of the total parameters per-query. | |||||||||||||||||
| |||||||||||||||||
| ▲ | idiotsecant 4 days ago | parent | prev [-] | ||||||||||||||||
I think it's the exact opposite - you don't specifically train each 'expert' to be a SME at something. Each of the experts is a generalist but becomes better at portions of tasks in a distributed way. There is no 'best baker', but things evolve toward 'best applier of flour', 'best kneader', etc. I think explicitly domain-trained experts are pretty uncommon in modern schemes. | |||||||||||||||||
| |||||||||||||||||