Remix clone Hacker News

new | show | ask | jobs Github

	▲	schipperai 4 hours ago
		With most OSS releases being MoEs, and modern GPUs optimized for MoEs, can somebody with knowledge of the topic explain or speculate why Mistral might have opted for a dense model?
	▲	ac29 4 hours ago \| parent [-]
		Modern GPUs aren't optimized for MoEs though? The advantage to a dense model like this Mistral one is that it is as smart as a much larger MoE model so it can fit on less GPUs. The tradeoff is that it is much slower since it has to read 100% of its weights for every token, MoE models typically only read about a tenth (though sparsity levels vary).