Remix clone Hacker News

new | show | ask | jobs Github

	▲	lostmsu 8 hours ago
		How large is the KV cache?
	▲	xbar 8 hours ago \| parent [-]
		0.1 GB per full-attention layer and "The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention." So, 1.5 GB.