Remix clone Hacker News

new | show | ask | jobs Github

	▲	alexellisuk 7 hours ago
		Is this going to need 1x or 2x of those RTX PRO 6000s to allow for a decent KV for an active context length of 64-100k? It's one thing running the model without any context, but coding agents build it up close to the max and that slows down generation massively in my experience.
	▲	redrove 4 hours ago \| parent \| next [-]
		I have a 3090 and a 4090 and it all fits in in VRAM with Q4_0 and quantized KV, 96k ctx. 1400 pp, 80 tps.
	▲	segmondy 6 hours ago \| parent \| prev [-]
		1 6000 should be fine, Q6_K_XL gguf will be almost on par with the raw weights and should let you have 128k-256k context.