Remix.run Logo
augment_me 3 days ago

The trained from scratch models are similar because CNN's are local and impose a strong inductive bias. If you train a CNN for any task of recognizing things, you will find edge detection filters in the first layers for example. This can't happen for attention the same way because its a global association, so the paper failed to find this using SVD and just fine-tuned existing models instead.