Remix clone Hacker News

new | show | ask | jobs Github

	▲	htrp 4 hours ago
		why 27b vs 35b? Is MoE that much worse for coding?
	▲	amarshall an hour ago \| parent \| next [-]
		Can take the geometric mean of total and active parameters of MoE to get approximate equivalent quality to dense model params. So sqrt(35*10)≈18.7. The trade-off of MoE is that it is worse but faster for the same total size.
	▲	electronsoup 3 hours ago \| parent \| prev [-]
		Yeah MoE is a little worse for the same size, but you can often run bigger MoEs at respectable speeds even on cpu ram offload. The dense models really need to be 100% vram