| ▲ | jychang 5 days ago |
| Coolest part of Qwen3-Next, in my opinion, (after the linear attention parts) is that they do MTP without adding another un-embedding matrix. Deepseek R1 also has a MTP layer (layer 61) https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/mod... But Deepseek R1 adds embed_tokens and shared_head.head tensors, which are [129280, 7168] or about 2GB in size at FP8. Qwen3-Next doesn't have that: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct/blob... So it saves a few GB in active parameters for MTP, which is a Big Deal. This is one of the changes that helps significantly speeds up inference. |
|
| ▲ | puilp0502 5 days ago | parent | next [-] |
| What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency? |
| |
| ▲ | jychang 5 days ago | parent | next [-] | | Speculative decoding! It makes inference a LOT faster. Instead of generating tokens one at a time, you generate the second one as well, and then use speculative decoding on that second token (instead of having it be produced by a draft model like Qwen 0.6b). If the token is checked and is correct, then the 2nd token gets generated MUCH faster. If it's wrong, you have to generate it again the normal way (a lot slower than just checking it). Usually, it's correct, so inference is a lot faster. | | |
| ▲ | stingraycharles 5 days ago | parent | next [-] | | Because then the second token only needs to be checked, not generated, as it’s already generated? And it’s much faster to generate multiple tokens at the same time than one at a time? Is that the idea? I’m not an expert on LLMs, just a user. | | |
| ▲ | tomp 4 days ago | parent | next [-] | | No, the parent is wrong. Checking a token is the same as generating it. The benefit however is in the next (third) token. After generating tokens 1 and 2 (in one turn), you start generating token 3 (and 4). You also get the “real” prediction for token 2. If the “real” prediction matches the MTP (Multi-Token Prediction) from previous turn, you have just generated 3 correct tokens (and another speculative). If not, you’ve now corrected token 2, but token 3 is wrong (it follows the wrong token 2) so you need ti generate it again. | | |
| ▲ | bigwheels 4 days ago | parent | next [-] | | Thanks for the clarification. Your comment made me connect the similarity (in spirit) of Speculative Decoding to Speculative Execution [1] in CPUs. Very cool and clever optimization strategy for LLMs, IMHO. [1] https://en.wikipedia.org/wiki/Speculative_execution Does it work to predict tokens 3 and 4 (or 5, 6) in the same way? I wonder how extreme the hit rate drop-off is. | |
| ▲ | jychang 2 days ago | parent | prev [-] | | To clarify, I should have stated: "Instead of generating tokens one at a time, you generate the second one as well WITH MTP, and then use speculative decoding on that second token (instead of having the second token be produced by a draft model like Qwen 0.6b). If the FIRST MTP token is checked and is correct, then the second token gets generated MUCH faster." |
| |
| ▲ | bdcs 4 days ago | parent | prev | next [-] | | It relies on an “unintuitive observation”[0] that you can run batches basically for free (up to a limit). So if you only run one inference, you batch it plus a lot of guesses and, if you guess right, can speed up the inference by the number of guesses. If you guess wrong, you're back to regular speed (and still fully correct). [0] https://x.com/karpathy/status/1697318534555336961 | |
| ▲ | namibj 5 days ago | parent | prev | next [-] | | Basically you can generate the next two tokens at once in the same matmul, and rollback to one-at-a-time when your generation said you guessed wrong (as that will mean the second of your pair you generated was generated based on revoked context). | |
| ▲ | Zacharias030 4 days ago | parent | prev [-] | | yes, if you know the sequence of tokens ahead of time you can verify them about as quickly as you can generate one more token because of the parallelism benefits. If you don’t know the future tokens though, then you can’t, and blind guessing of tokens is infeasible because the vocabulary contains circa 100k possible different tokens. |
| |
| ▲ | moffkalast 5 days ago | parent | prev [-] | | Hmm but isn't the checking only required because the draft model is not the same model and can only speculate what the main one is thinking, hence the name? If the main model generates two tokens itself, then how can it be wrong about its own predictions? | | |
| ▲ | jychang 5 days ago | parent | next [-] | | Because if you generate token n+1 with all 48 layers of Qwen3-Next and 80 billion params, and also generate token n+2 with the 1 MTP layer at 2bil params... that n+2 token can be much lower quality than the n+1 token but mostly correct. Let's say you have a model that generates the string "The 44th president of the United States is ___ ___". Your model will generate "Barack" as the n+1 token, and the MTP layer probably does a good enough job to generate "Obama" as the n+2 token (even though that MTP layer is a mere <2bil parameters in size). Then you just check if "Obama" is correct via the same speculative decoding process, which is a lot faster than if you had to start over from layer 1-48 and generate "Obama" the regular way. | | |
| ▲ | littlestymaar 5 days ago | parent [-] | | > Then you just check if "Obama" is correct via the same speculative decoding process, which is a lot faster than if you had to start over from layer 1-48 and generate "Obama" the regular way. That doesn't match my understanding of what speculative decoding does: AFAIK with regular speculative decoding you ask a smaller llm infer the next few tokens (let say 5 tokens) and then, you can have the big model infer token 1, 2, 3, 4, 5 and 6 in parallel (each time starting from the sentence partially completed by the smaller model). Because llms are bandwidth bound, doing the same work six times in parallel isn't slower than doing it only once (what's costly is moving the massive model weights between VRAM and the GPU cores). If token 1,2 and 3 match what the small models inferred, then you keep them. As soon as you have a mismatched token (say token 4) it means that you have to discard the next inferred tokens (here token 5 and 6) because they were calculated under a wrong assumption for token 4. So if the MTP layer merely replace the smaller llm in the previous scheme with everything else working the same way, you would save anything when inferring “Obama” (you'd still need to “generate it the regular way”, as there isn't really another way) but you could also start working on the word immediately after “Obama” by assuming “Obama” was already chose. And if the model actually outputted “Hussein” instead of “Obama”, then the token calculated to happen after “Obama” would have to be discarded. Or maybe my understanding of speculative decoding is completely off… | | |
| ▲ | vman512 4 days ago | parent [-] | | Sounds right. The policy for rejection can depend on what you want - you might accept the top K highest probability tokens or top P probability mass. Or you can do something like importance sampling and probabilistically reject based on the ratio of likelihoods |
|
| |
| ▲ | SonOfLilit 5 days ago | parent | prev | next [-] | | If you ask me to guess an answer, I'll _usually_ produce the same answer as if I had time to think about it deeply, but not always... | | | |
| ▲ | EMM_386 5 days ago | parent | prev | next [-] | | I believe it's something along these lines. The MTP head runs simultaneously and generates a probability list based on what it thinks the results will be, learned during training. If n+1 = "Barack" then n+2 = "Obama" (confidence: 0.90)
If n+1 = "The" then n+2 = "quick" (confidence: 0.45)
If n+1 = "President" then n+2 = "Biden" (confidence: 0.75) A threshold is set (say, as 90%) so that if the n+2 prediction is above that (as in the first example) it uses it without having to determine it with the main model. It's confident "enough". | | |
| ▲ | namibj 5 days ago | parent [-] | | Well yeah; also inference benefits massively from batching, so you use the guesses to pre fill context needed to infer the next speculated tokens, and if the guesses were wrong, you just have to re-compute the speculated ones that depended on the guessed context. You compute the next token and guess the one after; then you try to take the guess for real and compute the one after together with running inference for the guessed one, and the one after is speculated on the guess being correct. |
| |
| ▲ | eldenring 5 days ago | parent | prev [-] | | the 2nd token is generated without knowing what token was chosen for the 1st token |
|
| |
| ▲ | cubefox 5 days ago | parent | prev | next [-] | | > What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency? It is only useful for inference and doesn't help with pretraining. Which actually points to speculative decoding not being sufficiently general, as the same underlying property (some sequences of tokens are easy to predict) could be exploited for training as well. See here: https://goombalab.github.io/blog/2025/hnet-future/#d-footnot... | | |
| ▲ | Zacharias030 4 days ago | parent [-] | | There is no reason that it couldn’t be beneficial for training though. | | |
| ▲ | cubefox 3 days ago | parent [-] | | Except that speculative decoding is de facto only an inference time optimization. But the H-Net architecture from the previous reference, which doesn't require tokens or speculative decoding, does something similar both for inference and training. | | |
| ▲ | Zacharias030 3 days ago | parent [-] | | Yes, but the discussion is about Multi-Token Prediction (Gloeckle et al. 2024) which is only incidentally useful for speculative decoding. |
|
|
| |
| ▲ | rfoo 5 days ago | parent | prev [-] | | It could be a better draft model than separately trained EAGLE etc for speculative decoding. |
|
|
| ▲ | Razengan 5 days ago | parent | prev | next [-] |
| Could someone kindly point to a convenient all-on-one ELI5 of all these words? :') |
| |
| ▲ | porridgeraisin 5 days ago | parent | next [-] | | Background: LLMs take your input, upscale it into a very high dimensional space, and then downscale it back to 1D at the end. This 1D list is interpreted as a list of probabilities -- one for each word in your vocabulary. i.e f(x) = downscale(upscale(x)). Each of downscale() and upscale() are parameterized (billions of params). I see you have a gamedev background, so as an example: bezier curves are parameterized functions where bezier handles are the parameters. During training, these parameters are continuously adjusted so that the output of the overall function gets closer to the expected result. Neural networks are just really flexible functions for which you can choose parameters to get any expected result, provided you have enough of them (similar to bezier curves in this regard). --- When training, you make an LLM learn that I use arch = downscale(upscale(I use)) If you want to predict the next word after that, you do next in sequence the following: I use arch btw = downscale(upscale(I use arch)) Now, multi-token prediction is having two downscale functions, one for each of the next two words, and learning it that way, basically, you have a second downscale2() that learns how to predict the next-to-next word. i.e
in parallel: I use arch = downscale1(upscale(I use)) I use ____ btw = downscale2(upscale(I use)) However, this way you'll need twice the number of parameters downscale needs. And if you want to predict more tokens ahead you'll need even more parameters. What Qwen has done, is instead of downscale1 and downscale2 being completely separately parameterized functions, they set downscale1(.) = lightweight1(downscale_common(.)) and downscale2(.) = lightweight2(downscale_common(.)). This is essentially betting that a lot of the logic is common and the difference between predicting the next and next-to-next token can be captured in one lightweight function each. Lightweight here, means less parameters. The bet paid off. So overall, you save params. Concretely, Before: downscale1.params + downscale2.params After: downscale_common.params + lightweight1.params + lightweight2.params Edit: its actually downscale_common(lightweight()) and not the other way around as I have written above. Doesn't change the crux of the answer, but just including this for clarity. | | |
| ▲ | pmarreck 4 days ago | parent | next [-] | | so after your edit it would be (just to clarify): I use ____ ___ = downscale_common(lightweight1(.)) + downscale_common(lightweight2(.)) ?
And does it generate 2 at a time and keep going that way, or is there some overlap? | | |
| ▲ | porridgeraisin 4 days ago | parent [-] | | You generate blocks of 2 at a time yes. In general, k. As you can imagine, larger k performs worse. LLM(I like cats) is very likely to continue with "because they", but beyond that, there's too many possibilities. LLM(I like cats because they are) = small and cute and they meow, while LLM(I like cats because they eat) = all the rats in my garden. If you try to predict the whole thing at once you might end up with I like cats because they are all the rats and they garden > Overlap Check out an inference method called self-speculative decoding which solves(somewhat) the above problem of k-token prediction, which does overlap the same ___ across multiple computations. |
| |
| ▲ | Razengan 4 days ago | parent | prev | next [-] | | > I see you have a gamedev background Thanks for the tailored response! ^^ | |
| ▲ | losvedir 4 days ago | parent | prev | next [-] | | Ooooh, neat! That was very well explained, thank you. | |
| ▲ | fortyseven 5 days ago | parent | prev | next [-] | | Dude, this was like that woosh of cool air on your brain when an axe splits your head in half. That really brought a lot of stuff into focus. | |
| ▲ | JSR_FDED 5 days ago | parent | prev [-] | | Really good |
| |
| ▲ | lcnPylGDnU4H9OF 5 days ago | parent | prev | next [-] | | The best primer I've seen is Andrej Karpathy's first video in his "zero to hero" series. It's worth following along with your own practice. https://karpathy.ai/zero-to-hero.html | | | |
| ▲ | vessenes 5 days ago | parent | prev | next [-] | | Unfortunately, no. The industry is moving super quickly, and spinning up new ideas on the backs of old ones at a fast rate. If you want to understand what's going on, I think the best thing to do is some intro courses, train and design some smaller models directly, get a list of core papers and concepts from Claude/Chat/Gemini, and then as you read something like this, if you don't know the acronym (In this case: MTP = Multi Token Prediction), search it up, and see if you have the basis for understanding what it's about. If not read up on the precursors. Unlike many disciplines, AI is an arena that doesn't have a lot of intuitive simplified models that are accurate -- most of the simplified models available do not accurately describe what's going on enough to reason about and understand them. So, you just have to start reading! | | |
| ▲ | littlestymaar 4 days ago | parent [-] | | > Unfortunately, no. The industry is moving super quickly, and spinning up new ideas on the backs of old ones at a fast rate. I don't think it move this fast. I mean there is very little fundamental differences between GPT-2 and gpt-oss-120b, it's just about incremental improvement that don't change much to the full picture (using a variation of the attention architecture and masking, a different activation function, the positional encoding and changing the NLP layers to a sparse “mixture of expert”), at the end of the day, from Mistral to Deepseek going through llama and Qwen3 it's always the same stack of transformers layers with slight variations between two architectures. This Qwen3-Next is special though, as it's the first time a major player is releasing something that different (lesser players have made hybrid architecture LLMs for the past two years, but when it comes to language models, IBM really isn't comparable to Alibaba). This is what I expected Llama4 to be. |
| |
| ▲ | wickedsight 5 days ago | parent | prev | next [-] | | For me, ChatGPT or any of the other current thinking models are very useful for this type of stuff. I just ask to explain it on my level and then I can ask questions for clarification. | |
| ▲ | pmarreck 4 days ago | parent | prev [-] | | The following was generated by chatG5: Qwen3-Next — A family of large language models from Qwen (Alibaba).
DeepSeek R1 — Another large open-source language model from DeepSeek AI.
Linear attention — A type of transformer attention that scales linearly with sequence length, making long-context processing cheaper.
MTP (Multi-Token Prediction) — Training/inference trick where the model predicts multiple future tokens at once, speeding things up.
Embedding — Converts words/tokens into vectors (numbers) the model can work with.
Un-embedding — The reverse step: mapping the model’s internal vector back into tokens.
embed_tokens — The big lookup table of embeddings (token → vector).
shared_head.head tensors — Extra weight matrices used for prediction; they can be huge.
[129280, 7168] — The shape of such a tensor: ~129k rows (tokens in the vocab) × 7k columns (hidden dimension).
FP8 — Floating-point format using 8 bits (compact, faster, less precise).
Active parameters — The weights that actually need to be loaded in GPU memory to run the model.
Inference — Running the model to generate text (as opposed to training it).
GB savings — If you avoid duplicating giant matrices, you save GPU memory and speed things up.
|
|
|
| ▲ | humblyCrazy 5 days ago | parent | prev [-] |
| How is MTP different from Medusa heads? Also does this mean this model comes "natively" with speculative decoding - meaning if I use this model in vllm, it's throughput should be higher because it is already doing MTP so it should be able to take advantages of speculative decoding? |