Remix clone Hacker News

new | show | ask | jobs Github

	▲	ethan_smith 3 days ago
		Attention weights can still assign non-zero probability to irrelevant tokens since the mechanism optimizes for prediction rather than semantic relevance, and these irrelevant tokens can create interference in the hidden state representations.