Remix clone Hacker News

new | show | ask | jobs Github

	▲	fheinsen 4 hours ago
		As the error via linear approximation approaches similar magnitude as numerical error via quadratic computation, don’t the two start becoming comparable in practice? I ask because in practice, for inference, attention is typically computed with low-precision (4-bit, 8-bit, 16-bit) floats. Numerical error, in fact, may be a key factor as to why quadratic attention, in practice, exhibits context rot as context gets longer, analogous to an RNN: https://www.anthropic.com/engineering/effective-context-engi...