| ▲ | HarHarVeryFunny 6 hours ago | |
MoE is from Google (Noam Shazeer) MTP is from Meta Another DeepSeek advance that the west are copying is DeepSeek Sparse Attention (DSA) | ||
| ▲ | xgk 17 minutes ago | parent [-] | |
Mixture-of-Expert (MoE) was introduced in the 1990s [1, 2], see also [3, 4]. The idea was that MoE scales up model capacity and only introduces small computation overhead. MoEs did not become viable for high-performance applications until sparse routing was integrated with modern deep networks, made possible by large-scale distributed computation. The breakthrough came with the development of sparsely gated networks [5], which showed that it is possible to maintain model accuracy while activating only a small fraction of a large parameter network during both training and inference. [1] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, G. E. Hinton, Adaptive mixtures of local experts. (1991) [2] M. I. Jordan, R. A. Jacobs, Hierarchical mixtures of experts and the EM algorithm. (1993) [3] L. Xu, M. Jordan, G. E. Hinton, An alternative model for mixtures of experts. (1994) [4] S. Waterhouse, D. MacKay, A. Robinson, Bayesian methods for mixtures of experts. (1995) [5] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, J. Dean, Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. (2017) | ||