> Does this translate into a similar reduction in compute?

No, quite the opposite actually. Like with speculative decoding this model will compute more tokens and discard the invalid ones.

> What's the catch?

LLMs[1] are limited by memory latency and not by compute[2]: because they process tokens one at a time, you spend more time loading and unloading the weights on the GPU registers from VRAM than waiting for compute to happen. Techniques like these allow to process multiple tokens in parallel instead of one by one, and as such exploit better the compute of your graphic card. They do so by predicting which tokens are likely to occur and then verifying that the guess was correct.

For instance if the previous token is “hello”.

A regular autoregressive LLM will compute:

“hello” => “! ”,

then “hello! ” => “how ”,

“hello! how ” => “are ”,

“hello! how are ” => “you”.

and finally “hello! how are you” => “?<end>”

One at a time. Loading and unloading every weights 5 times from the GPU memory to its compute units.

With speculative decoding (I'd say this one isn't strictly speculative decoding, but it's a variant of the same principle), you have something that guesses that the whole sentence is going to be “how are you today?”, so the LLM can generate

“hello” => “! ”,

“hello! ” => “how ”,

“hello! how ” => “are ”,

“hello! how are ” => “you”.

“hello! how are you” => “?<end>”

“hello! how are you today” => “?<end>”

In parallel. So each weight would have been loaded only once from the VRAM instead of 5.

The last token will be discarded though, as the prefix “how are you today” doesn't match what has actually been generated. So in that particular example, you'd have gotten your 5 tokens 5 times faster than with pure autoregressive inference, but at the expense of a 6th token being generated and discarded immediately. So 5 times more token throughtput, but 20% compute cost increase per token.

[1]: autoregressive LLMs, that is. Which are the ones everybody uses because they are the most performant.

[2]: at least when run at low batch size, on your own computer for your personal use. On a datacenter, with many concurrent users, GPUs are actually compute-bound.

▲

kreelman an hour ago | parent [-]

Fantastic results. Well done. ...So this is built into the way the model works.. if I'm understanding it correctly.

I was wondering what would be involved in getting it to work with GGUF files, rather than safetensor files...

	▲	dot_treo an hour ago \| parent [-]
		Just to get it into a GGUF file would be fairly trivial. But using that GGUF file would need a bunch of additional things. One would need to create a new architecture derived from Qwen3, and then probably adapt the speculative decoding functionality. At the moment not even MTP is merged into llama.cpp, so I wouldn't quite hold my breath for it.