Remix clone Hacker News

new | show | ask | jobs Github

	▲	kaoD 8 hours ago
		> The model weights stay resident in VRAM permanently so there's no loading/unloading per request. Yes, I was thinking about context buffers, which I assume are not small in large models. That has to be loaded into VRAM, right? If I keep sending large context buffers, will that hog the batches?
	▲	8 hours ago \| parent \| next [-]
		[deleted]
	▲	jrandolf 7 hours ago \| parent \| prev [-]
		Not if you are the only one. We have rate limits to prevent this in case, idk, you share your key with 1000 people lol.