Remix.run Logo
2001zhaozhao 6 hours ago

This is 128B dense though. the K/V cache on long context is going to be massive

Havoc 5 hours ago | parent | next [-]

Don’t think kv size correlates to dense/moe

zozbot234 5 hours ago | parent [-]

KV size correlates with attention parameters which are a subset of active parameters. So a typical MoE model will have way lower KV size than a dense model of equal total parameter count.

syntaxing 4 hours ago | parent | prev [-]

With turbo quant, you would reduce it by over 6X.