Remix clone Hacker News

new | show | ask | jobs Github

	▲	augment_me 3 days ago
		The trained from scratch models are similar because CNN's are local and impose a strong inductive bias. If you train a CNN for any task of recognizing things, you will find edge detection filters in the first layers for example. This can't happen for attention the same way because its a global association, so the paper failed to find this using SVD and just fine-tuned existing models instead.