Remix.run Logo
GardenLetter27 2 hours ago

It's not just the architecture but also the data - the decoder only approach lets you train in parallel over blocks of text (no RNN serial waiting), that allows you train on much, much more data.