Remix clone Hacker News

new | show | ask | jobs Github

	▲	refibrillator 8 months ago
		Note to others reading along: in the last appendix page the OP paper reports DFloat11 reduces tokens/sec by ~2-3x for the Llama-3.1-8b and Qwen-2.5-14b/32b and Mistral-small-24b models (throughput penalty not reported for others). Using DFloat11, tokens/sec was higher only when compared relative to running inference with some layers offloaded to CPU. Classic comp sci tradeoff between space and speed, no free lunch, etc.