| ▲ | Havoc 8 hours ago | |||||||
> The best strategy is to shrink the model until it fits — either with EXL3 quantization or ModelOpt PTQ — and use GreenBoost's DDR4 pool for KV cache only. Does this make sense? I'd have thought the KV is guaranteed to be used 100% of the time while say in a MoE the same can't be said of the weights. Though I suppose if you're shooting for huge context then having that allocation go into ram makes sense specially when its allocated but not used yet | ||||||||
| ▲ | alexeldeib 4 hours ago | parent [-] | |||||||
KV cache is, well, a cache that can fill up and trigger eviction. You require enough space to execute at least 1 fwd pass of 1 request at your context length. KV cache hits reduce TTFT by avoiding prefill. You don’t get to skip decode. MoE is kinda related in terms of lower usage requirements vs a dense model of same total param size, but I think your mental model is a bit off. | ||||||||
| ||||||||