Remix clone Hacker News

new | show | ask | jobs Github

	▲	yjftsjthsd-h 8 months ago
		> Compared to a potential alternative of offloading parts of an uncompressed model to the CPU to meet memory constraints, DFloat11 achieves 1.9-38.8x higher throughput in token generation. With a fixed GPU memory budget, DFloat11 enables 5.3-13.17x longer context lengths than uncompressed models. The context length alone probably makes it worthwhile even if your models fit in memory, but I'm curious if it improves tokens/sec even all on GPU, since in my very amateur understanding LLMs tend to be constrained by memory bandwidth?
	▲	brigade 8 months ago \| parent \| next [-]
		It does not; the decompression is memory to memory, one tensor at a time, so it’s worse. They claim less than 200 GB/s on an A100, and their benchmarks suggest it’s somewhere between 1.5-4x slower at batch size 1 depending on GPU and model. This overhead of course mostly disappears with a large enough batch size. Other lossless codecs can hit 600 GB/s on the same hardware, so there should be some room for improvement. But A100’s raw memory bandwidth is 1.6 TB/s
	▲	philjohn 8 months ago \| parent \| prev \| next [-]
		My mental model is saying it might do, much like on slow hard drives DoubleSpace in DOS slightly sped up loading data from disk.
	▲	hnuser123456 8 months ago \| parent \| prev [-]
		If the model is 70% the size, it will be 1/0.7 = 1.43x the speed.