Remix clone Hacker News

new | show | ask | jobs Github

	▲	ethan_smith 4 days ago
		Great points about training optimizations. For inference, similar dramatic memory reductions are possible through quantization (INT4/INT8) which can reduce VRAM needs by 2-8x compared to FP16, allowing much larger models on consumer GPUs.