Remix clone Hacker News

new | show | ask | jobs Github

	▲	ozgung 5 hours ago
		I think the biggest problem is that most tutorials use words to illustrate how the attention mechanism works. In reality, there are no word-associated tokens inside a Transformer. Tokens != word parts. An LLM does not perform language processing inside the Transformer blocks, and a Vision Transformer does not perform image processing. Words and pixels are only relevant at the input. I think this misunderstanding was a root cause of underestimating their capabilities.