Remix.run Logo
zdw 6 hours ago

MTP support is being addedto llama.cpp, at least for the Qwen models ( https://github.com/ggml-org/llama.cpp/pull/20533) and I'd imagine Gemma 4 will come soon.

The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.

tarruda 5 hours ago | parent | next [-]

There is a newer PR which will probably be merged soon: https://github.com/ggml-org/llama.cpp/pull/22673

entropicdrifter 5 hours ago | parent | next [-]

Ollama merged a PR for MTP about 2 hours ago, as well:

https://github.com/ollama/ollama/pull/15980

Edit: Seems they also have a pre-release version out with the functionality added: https://github.com/ollama/ollama/releases/tag/v0.23.1-rc0

xlayn an hour ago | parent | prev | next [-]

Ohhhh geee!!! I just applied the patch to my local git copy. You need to use the model on the PR that he submitted, the model is particular because it has extra information that allows the MTP to happen. I have two amd gpus, and qwen3.6 27B qk6 does around 20t/s generation... If I run it only on one I get like 35t/s.

But with this patch I saw 46t/s with qwen3.6 27B q8... this is insane, it's 250% faster than the original speed, there was no gpu I could upgrade to get that kind of boost, amazing!

3 hours ago | parent | prev [-]
[deleted]
nzeid 3 hours ago | parent | prev | next [-]

A few days ago I switched again from Qwen3.6 to Gemma 4 - for personal use I've experienced better average performance with the 26B version of the latter than the 27B of the former.

For someone who's been running local models for a long while, these are very very exciting times.

apexalpha 3 hours ago | parent [-]

I’ve been swapping between these too as well.

However I find qwen unbeatable for toolcallling. I think gemma wasnt trained on that at all.

nzeid 2 hours ago | parent | next [-]

I'm using llama.cpp with Gemma and tool calling is mission critical. It's perfectly fine on my end.

There are definitely differences in the eagerness to tool-call that you'll need to manage. And for all local models I've ever used, I've had to micromanage the tools provided by servers to eliminate any possibility that they reach for something wonky or confusing.

sigmoid10 2 hours ago | parent | prev [-]

Gemma certainly was trained for tool calling, but the implementation in llama.cpp has been troubled because Gemma uses a different chat template format. The processor from the transformers library works fine though.

fridder 3 hours ago | parent | prev | next [-]

I'd love to see this in oMLX too. It has been a rather nice tool

basch 5 hours ago | parent | prev | next [-]

I have a dumb performance question.

Why when asking a model to change text in a minor way; are we not asking it to generate the operational transformations necessary to modify the text, and then just executing the ot on the existing text vs reproducing every token? Maybe tools are doing that more than I realize?

XYen0n 4 hours ago | parent | next [-]

The only thing a model can output is tokens; to achieve this, a tool of converting tokens into operational transformations is required. For example, I have an ast-grep skill, it will instruct the model to generate ast-grep rules and run ast-grep to perform file modifications.

basch 3 hours ago | parent [-]

I am saying to directly output the operational transformation instructions as the tokens. You’re essentially telling it to “write the diff” and then applying the patch.

[retain(8), delete(6), insert("very very"), retain(10)]

sigmoid10 3 hours ago | parent | prev | next [-]

The simple answer is: because it is not necessary to achieve the same final output. Most LLMs today are trained as autoregressive token predictors. They fundamentally can't work any other way. But we know how to train them really well and they have many applications beyond editing text. Diffusion LLMs exist too, which work a bit closer to what you describe, but they are not yet at the same level of intelligence since training methods are not that mature and they are generally less flexible as well.

basch 3 hours ago | parent [-]

So predict the tokens of the operational transformation.

I just asked: Write the operational transformation sequence and command to turn “this is really beautiful” to “this is very very beautiful”

and in return got: You can map this out by moving a virtual cursor across the text and telling it what to keep, remove, or add. You start by retaining the first eight characters to keep "this is " untouched. Then you delete the next six characters to remove the word "really". In that exact spot, you insert the nine characters for "very very". You finish the operation by retaining the final ten characters, which preserves the space and the word "beautiful". You can code this specific command sequence as [retain(8), delete(6), insert("very very"), retain(10)].

In a large paragraph of text I would expect it to be way quicker and cheaper to generate “[retain(800), delete(6), insert("very very"), retain(10000)]” than repredict the entire remainder of the unedited text.

sigmoid10 2 hours ago | parent [-]

Sounds easy, but isn't in practice. You can look at the edit text file tool in va code copilot for example to see how complicated that can get: https://github.com/microsoft/vscode-copilot-chat/tree/9e668c...

basch 2 hours ago | parent [-]

I have no idea when I’m being lied to anymore but allegedly Aider and Cursor work the way I described, although cursor is using a second model to apply the edit.

jfim 2 hours ago | parent | prev | next [-]

I've seen Claude use sed to edit files on other hosts instead of copying the file back and forth to edit it. Not quite full blown OT but it's going in that direction.

cryptoz 4 hours ago | parent | prev [-]

This is the approach I take with code edits to existing files at Code+=AI; I wrote a blog post with a simple example of AST modification to illustrate: https://codeplusequalsai.com/static/blog/prompting_llms_to_m...

EGreg 6 hours ago | parent | prev | next [-]

How does this get added in practice?

flakiness 6 hours ago | parent [-]

According to the linked PR, the original model does come with MTP which is another "head" (=output path) in the same model and (supposedly) runs faster.

The current implementation ignores that head but the PR let the tool recognize it, plus does proper integration (run the MTP while running the slower main path then compare the result, I believe.)

flebron 3 hours ago | parent [-]

The standard way of doing MTP is to run the drafter autoregressively for k steps, and then (not concurrently) use the larger model as a verifier for those k tokens at the same time. The larger model can then accept a prefix of those k tokens, and in any case generates one more token (which is needed in case you accepted zero tokens from the drafter). The larger model can effectively use this k as a "batch" dimension, reducing the penalty of large weight loading. Meanwhile the drafter is much smaller, so it's fine for _it_ to be autoregressive, as long as the main model is parallel.

dakolli 6 hours ago | parent | prev | next [-]

yet, still mostly useless.

WhitneyLand 6 hours ago | parent | prev [-]

Yeah important conceptually to remember MTP is kind of just more weights, but speculative decoding is the runtime algorithm that’s a significant add to whatever code is serving the model.

HumanOstrich 5 hours ago | parent [-]

That is.. inaccurate.

WhitneyLand 2 hours ago | parent [-]

How so? I’m not saying most of work doesn’t go into creating the drafting model or enabling a new head on the primary model, but the point is that however cool it is the result is, more weights. Speculative decoding requires code to be aware of how this works at the inference level.