Remix clone Hacker News

new | show | ask | jobs Github

	▲	jychang 11 hours ago
		What's not to believe? Qwerky-32b has already done something similar as a finetune of QwQ-32b but not using traditional attention architecture. And hybrid models aren't new, MLA based hybrid models is basically just Deepseek V3.2 in a nutshell. Note that Deepseek V3.2 (and V3.1, R1, and V3... and V2 actually) all use MLA. Deepseek V3.2 is what adds the linear attention stuff. Actually, since Deepseek V3.1 and Deepseek V3.2 are just post-training on top of the original Deepseek V3 pretrain run, I'd say this paper is basically doing exactly what Deepseek V3.2 did in terms of efficiency.