Remix clone Hacker News

new | show | ask | jobs Github

	▲	zozbot234 5 hours ago
		KV size correlates with attention parameters which are a subset of active parameters. So a typical MoE model will have way lower KV size than a dense model of equal total parameter count.