Remix clone Hacker News

new | show | ask | jobs Github

	▲	mcrutcher 5 days ago
		MoE models are pretty poorly named since all the "experts" are "the same". They're probably better described as "sparse activation" models. MoE implies some sort of "heterogenous experts" that a "thalamus router" is trained to use, but that's not how they work.