Remix.run Logo
lostmsu 8 hours ago

How large is the KV cache?

xbar 8 hours ago | parent [-]

0.1 GB per full-attention layer and "The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention." So, 1.5 GB.