Remix clone Hacker News

new | show | ask | jobs Github

	▲	lelanthran 3 hours ago
		> any case, loading a gigantic model just to use system RAM is absurdly slow (due to mem bandwidth), like 1-5 t/s, so it's not practical. It'd take a whole day to process one 86k token reques So don't use it for large requests. Ideal for when you just want to categorise things, for example, "does this task need a shell" or "bucket this email into one of help request, bill due or personal comms".
	▲	zozbot234 2 hours ago \| parent [-]
		The best use is actually for a layer that "almost fits" into VRAM, such that automated offloading to system RAM will be rare enough that it doesn't impact performance.