Remix clone Hacker News

new | show | ask | jobs Github

	▲	ctoa 2 hours ago
		It's sort of an RNN, but it's also basically a transformer with shared layer weights. Each step is equivalent to one transformer layer, the computation for n steps is the same as the computation for a transformer with n layers. The notion of context window applies to the sequence, it doesn't really affect that, each iteration sees and attends over the whole sequence.