| ▲ | Havoc 2 days ago | |
Don’t think kv size correlates to dense/moe | ||
| ▲ | zozbot234 2 days ago | parent [-] | |
KV size correlates with attention parameters which are a subset of active parameters. So a typical MoE model will have way lower KV size than a dense model of equal total parameter count. | ||