Remix.run Logo
zozbot234 3 days ago

AIUI, the thinking when developing transformers might have been that "reading text A vs. text B" just isn't parallel enough for truly large-scale training. The problem was to somehow also parallelize the learning of very long range dependencies within a single sequence, and transformers managed to do that.