The craziest part is how far MoE has come thanks to Qwen. This beats all those 72B dense models we’ve had before and runs faster than 14B model depending on how you off load your VRAM and CPU. That’s insane.

▲

moffkalast 5 days ago | parent | next [-]

In retrospect it's actually funny that last year Meta spent so many resources training a dense 405B model that both underperforms compared to models a tenth its size and is impossible to run at a reasonable speed on any hardware in existence.

▲

jychang 5 days ago | parent | next [-]

Strong disagree.

Llama 4's release in 2025 is (deservedly) panned, but Llama 3.1 405b does not deserve that slander.

https://artificialanalysis.ai/#frontier-language-model-intel...

Do not compare 2024 models to the current cutting edge. At the time, Llama 3.1 405b was the very first open source (open weights) model to come close to the closed source cutting edge. It was very very close in performance to GPT-4o and Claude 3.5 Sonnet.

In essence, it was Deepseek R1 before Deepseek R1.

▲

seunosewa 5 days ago | parent [-]

He is definitely talking about Llama4.

▲

lcnPylGDnU4H9OF 5 days ago | parent | next [-]

> last year

> dense

> 405B model

Llama4 does not match any of these details. Maybe the commenter thinks their comment is about Llama4 (I don't see a reason to believe so) but readers familiar with these details know they are referring to Llama3.1.

	▲	4 days ago \| parent [-]
		[deleted]

▲

jug 4 days ago | parent | prev [-]

Llama 4 is neither from last year nor a dense model.

▲

NitpickLawyer 5 days ago | parent | prev [-]

It's not that clear. Yes, it underperforms in recent benchmarks and usecases (i.e. agentic stuff), but it is still one of the strongest open models in terms of "knowledge". Dense does have that advantage of MoE, even if it's extremely expensive to run inference on.

Check out this great exercise - https://open.substack.com/pub/outsidetext/p/how-does-a-blind...

	▲	moffkalast 4 days ago \| parent [-]
		Ok wow that is incredibly interesting, what a test. I would've honestly expected just random noise (like if you gave this same task a human, lol) but you can even see related models draw similar results. Maybe it is an indicator of overall knowledge, or how consistent the world model is. It also could not correlate at all with non-geographical knowledge.

▲

halJordan 4 days ago | parent | prev [-]

Qwen isn't directing the forward progress of llms. SOTA llms have been moe since gpt-4. The og 4.

Out of context, but i honestly hate how HN let itself get so far behind the times that this is the sort of inane commentary we get on AI.

	▲	refulgentis 4 days ago \| parent [-]
		I would venture to suggest that to read it as "Qwen made MoEs in toto \|\| first \|\| better than anyone else" is reductive - merely, the # of experts and #s here are quite novel (70b...inferencing only 3b!?!) - I sometimes kick around the same take, but, thought I'd stand up for this. And I know what I'm talking about, I maintain a client that wraps llama.cpp x ~20 models on inference APIs