Remix clone Hacker News

new | show | ask | jobs Github

	▲	alecco 4 hours ago
		Yeah, not a great apples-to-apples comparison. I think the point stands: MoE, a myriad of complex attention approaches, shared layers, you name it. And making it all work together well is a huge trial-and-error pain even for small models, never mind getting to efficient hardware utilization.