Remix clone Hacker News

new | show | ask | jobs Github

	▲	noosphr 2 hours ago
		Yes, and it works in theory. Less so in practice. You saturate the memory of a b200 with a few dozen tokens on attentions higher than order 4. Training is even worse. To paraphrase Knuth: high order polynomials are much more unimaginably large than mere infinity.