Remix.run Logo
ctoa 2 hours ago

It's sort of an RNN, but it's also basically a transformer with shared layer weights. Each step is equivalent to one transformer layer, the computation for n steps is the same as the computation for a transformer with n layers.

The notion of context window applies to the sequence, it doesn't really affect that, each iteration sees and attends over the whole sequence.