Remix clone Hacker News

new | show | ask | jobs Github

	▲	woadwarrior01 3 days ago
		It's just a long winded way of saying "tied embeddings"[1]. IIRC, GPT-2, BERT, Gemma 2, Gemma 3, some of the smaller Qwen models and many more architectures use weight tied input/output embeddings. [1]: https://arxiv.org/abs/1608.05859