I don't know what's so special about this paper.

- They claim to use MLA to reduce KV cache by 90%. Yeah, Deepseek invented that for Deepseek V2 (and also V3 and Deepseek R1 etc)

- They claim to use a hybrid linear attention architecture. So does Deepseek V3.2 and that was weeks ago. Or Granite 4, if you want to go even further back. Or Kimi Linear. Or Qwen3-Next.

- They claimed to save a lot of money not doing a full pre-train run for millions of dollars. Well, so did Deepseek V3.2... Deepseek hasn't done a full $5.6mil full pretraining run since Deepseek V3 in 2024. Deepseek R1 is just a $294k post train on top of the expensive V3 pretrain run. Deepseek V3.2 is just a hybrid linear attention post-train run - i don't know the exact price, but it's probably just a few hundred thousand dollars as well.

Hell, GPT-5, o3, o4-mini, and gpt-4o are all post-trains on top of the same expensive pre-train run for gpt-4o in 2024. That's why they all have the same information cutoff date.

I don't really see anything new or interesting in this paper that isn't already something Deepseek V3.2 has already sort of done (just on a bigger scale). Not exactly the same, but is there anything amazingly new that's not in Deepseek V3.2?

▲

twotwotwo 3 hours ago | parent | next [-]

These are potentially complementary approaches. Various innovations have shrunk the KV cache size or (with DSA) how much work you have to do in each attention step. This paper is about hybrid models where some layers' state needs don't grow with context size at all.

SSMs have a fixed-size state space, so on their own they'll never going be able to recite a whole file of your code in a code-editing session for example. But if much of what an LLM is doing isn't long-distance recall, you might be able to get away with only giving some layers full recall capability, with other layers manipulating the info already retrieved (plus whatever's in their own more limited memory).

I think Kimi Linear Attention and Qwen3-next are both doing things a little like this: most layers' attention/memory doesn't grow with context size. Another approach, used in Google's small open Gemma models, is to give some layers only 'local' attention (most recent N tokens) and give a few 'full' (whole context window) attention. I guess we're seeing how those approaches play out and how different tricks can be cobbled together.

There can potentially be a moneyball aspect to good model architecture. Even if on its own using space-saving attention mechanisms in some layers of big models cost something in performance, their efficiency could allow you to 'spend' more elsewhere (more layers or more params or such) to end with overall better performance at a certain level of resources. Seems like it's good to have experiments with many different approaches going on.

▲

credit_guy 9 hours ago | parent | prev | next [-]

Here's what's important about this paper. It is written by AMD researchers. It shows AMD is investing in AI research. Is this the same level of achievement as DeepSeek 3.2. Most likely not. Do they have novel ideas? Difficult to say, there are hundreds of new ideas being tried in this space. Is this worthless? Most certainly not. In order to make progress in this domain (as in any other), you first need to get your feet wet. You need to play with the various components, and see how they fit together. The idea in this paper is that you can combine somehow SSMs (like Mamba) and LLMs (like LLama). The examples they give are absolute toys compared to DeepSeek 3.2 (the largest is 8 billion parameters, while DeepSeek 3.2 has 671 billion parameters). The comparison you are trying to make simply does not apply. The good news for all of us is that AMD is working in this space.

	▲	jychang 4 hours ago \| parent [-]
		Mamba based LLMs aren't even close to novel though. IBM's been doing this since forever [1]. Also, you're off on Deepseek V3.2's param count, the full model's 685B in size with the MTP layer. I don't think there's anything interesting here other than "I guess AMD put out a research paper", and it's not cutting edge when Deepseek or even IBM is running laps around them. [1] Here's a news article from April, although IBM has been doing it for a long time before that https://research.ibm.com/blog/bamba-ssm-transformer-model

▲

T-A 9 hours ago | parent | prev | next [-]

From your link: DeepSeek-V3.2 Release 2025/12/01

From Zebra-Llama's arXiv page: Submitted on 22 May 2025

	▲	jychang 4 hours ago \| parent [-]
		That's still behind the times. Even the ancient dinosaur IBM had released a Mamba model [1] before this paper was even put out. > Granite-4.0-Tiny-Base-Preview is a 7B-parameter hybrid mixture-of-experts (MoE) language model featuring a 128k token context window. The architecture leverages Mamba-2, superimposed with a softmax attention for enhanced expressiveness, with no positional encoding for better length generalization. Release Date: May 2nd, 2025 I mean, good for them for shipping, I guess. But seriously, I expect any postgrad student to be able to train a similar model with some rented GPUs. They literally teach MLA to undergrads in the basic LLM class at Stanford [2] so this isn't some exactly some obscure never-heard-of concept. [1] https://huggingface.co/ibm-granite/granite-4.0-tiny-base-pre... [2] https://youtu.be/Q5baLehv5So?t=6075

▲

SilverElfin 4 hours ago | parent | prev | next [-]

How did you get all this info about how each is trained? Is that something they admit now or is it through leaks?

	▲	jychang 3 hours ago \| parent [-]
		Deepseek? It's literally in their research papers. OpenAI? The OpenAI head of research @markchen90 straight up admitted it in a podcast. https://x.com/petergostev/status/1995744289079656834 "In the last 2 years we've put so much resourcing into, into reasoning and one byproduct of that is you lose a little bit of muscle on pre training and post training." "In the last six months, @merettm and I have done a lot of work to build that muscle back up." "With all the focus on RL, there's an alpha for us because we think there's so much room left in pre training." "As a result of these efforts, we've been training much stronger models. And that also gives us a lot of confidence carrying into Gemini 3 and other releases coming this end of the year." Note, "alpha" in the quote above is referring to https://en.wikipedia.org/wiki/Alpha_(finance) But it's pretty clear that the last full pretrain run they've released is for gpt-4o 2 years ago, and since then they've just been iterating RL for their models. You don't need any insider information to notice that, it's pretty obvious. Excluding GPT-4.5 of course, but even OpenAI probably wants us to forget about that.

▲

nickpsecurity 9 hours ago | parent | prev [-]

"Deepseek hasn't done a full $5.6mil full "

Don't forget the billion dollars or so of GPU's they had access to that they left out of that accounting. Also, the R&D cost of the Meta model they originally used. Then, they added $5.6 million on top of that.