Remix.run Logo
Havoc 2 days ago

Don’t think kv size correlates to dense/moe

zozbot234 2 days ago | parent [-]

KV size correlates with attention parameters which are a subset of active parameters. So a typical MoE model will have way lower KV size than a dense model of equal total parameter count.