Remix clone Hacker News

new | show | ask | jobs Github

	▲	Jasssss 4 hours ago
		The plan command is clever. How do you handle the VRAM estimation for models with sliding window attention vs full context? Something like Mistral at 32k context uses way less KV cache than Llama at the same context length, but from the README it looks like the estimation is based on a fixed context size. Does it account for that?
	▲	andyyyy64 2 hours ago \| parent [-]
		Good catch that's a real gap. The KV estimate is GQA/MQA-aware (per-model head config) but currently assumes dense full-context attention; it does not model sliding-window / chunked attention, so for SWA models like Mistral or Gemma at long context it over-estimates KV. The error is conservative — it tells you a model needs more than it does, not less, so it won't push you into an OOM — but it's still wrong. I'll open a tracking issue with per-architecture window sizes; if you have a reference for the exact SWA configs you care about it'll speed the fix. This is the kind of report I posted for.