| ▲ | agunapal 2 days ago | ||||||||||||||||||||||
If you really think about why MoE came into existence, its to save significant cost during training, I don't think there was any concrete evidence of performance gains for comparable MoE vs dense models. Over the years, I believe all the new techniques being employed in post training have made the models better. | |||||||||||||||||||||||
| ▲ | vessenes 2 days ago | parent | next [-] | ||||||||||||||||||||||
I think you mean inference compute? I believe all expert weights are updated in each backward pass during MoE training. The first benefit was getting a sort of structured pruning of weights through the mechanism of expert selection so that the model didn’t need to go through ‘unnecessary’ parts of the model for a given token. This then let inference use memory more efficiently in memory constrained environments, where non-hot or less common experts could be put into slow RAM, or sometimes even streamed off storage. But I don’t think it necessarily saved training cost; if it did, I’d be interested to learn how! | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | zozbot234 2 days ago | parent | prev [-] | ||||||||||||||||||||||
MoE models will have far more world knowledge than dense models with the same amount of active parameters. MoE is a no-brainer if your inference setup is ultimately limited by compute or memory throughput - not total memory footprint - or alternately if it has fast, high-bandwidth access to lower-tier storage to fetch cold model weights from on demand. | |||||||||||||||||||||||
| |||||||||||||||||||||||