| ▲ | ctoa 2 hours ago | |
It's sort of an RNN, but it's also basically a transformer with shared layer weights. Each step is equivalent to one transformer layer, the computation for n steps is the same as the computation for a transformer with n layers. The notion of context window applies to the sequence, it doesn't really affect that, each iteration sees and attends over the whole sequence. | ||