Remix clone Hacker News

new | show | ask | jobs Github

	▲	edg5000 7 hours ago
		So limiting max context length also reduces VRAM needs a bit? If cache is 20% of total, 1/10th of context as a limit would mean 18% total memory reduction.
	▲	valine 7 hours ago \| parent [-]
		Yup exactly, in principle it helps with both inference speed by reducing memory bandwidth usage and also reduces the memory footprint of your kvcache.