| ▲ | thesz 6 hours ago | |
https://en.wikipedia.org/wiki/Mixture_of_experts#Sparsely-ga... "The sparsely-gated MoE layer,[21] published by researchers from Google Brain, uses feedforward networks as experts, and linear-softmax gating. Similar to the previously proposed hard MoE, they achieve sparsity by a weighted sum of only the top-k experts, instead of the weighted sum of all of them." "Top-k experts," in case of some DeepSeek's models k=1. | ||