Remix.run Logo
tripplyons 2 days ago

The output of the recurrence is still dependent on previous tokens, but it usually less expressive within the recurrence in order make parallelism possible. In MinGRU the main operation used to share information between tokens is addition (with a simple weighting).

You could imagine after one layer of the recurrence, the tokens already have some information about each other, so the input to the following layers is dependent on previous tokens, although the dependence is indirect compared to a traditional RNN.