I haven't tried this model yet, but I can run Gemma 31B w/ the MTP drafter in pure CPU at about 10tok/s so this should run at about 20-30tok/s on a decent CPU, it'll probably run at >50tok/s on any Mac that can fit it, and lots of people have a gaming GPU with enough VRAM. In terms of access to hardware being a gate, it's one you can hop pretty easily.

▲

dofm 10 days ago | parent [-]

Could you outline how you are running the MTP drafters? I've tried LM Studio but no dice there. I'm probably missing something but I think llama.cpp and Ollama can't do it yet either?

▲

thot_experiment 9 days ago | parent | next [-]

I just build llama.cpp from scratch on the PR that has MTP drafters.

https://github.com/ggml-org/llama.cpp/pull/23398

Please don't use Ollama, it's a bad actor in the OSS community.

▲

dofm 9 days ago | parent [-]

I don't have the energy to build stuff all the time, that's a rabbit-hole side tunnel I don't really want to get into. I have larger concerns in my life that are more urgent than developing that side of things.

But I've moved on from Ollama for the time being, though I am mainly interested to see what the Gemma 4 MTP speeds are like on my M1 Max, so I may test it.

I am quite impressed with the tools in LM Studio, which is also a beautiful app, but it is not open source (which challenges my personal strategy somewhat) and I dread its inevitable enshittification.

Nevertheless the GUI has been very helpful while I learn, and I will probably use it until something else presents or my usage pattern settles down from experimentation to something a bit more routine.

I will try oMLX, too, but judging by the LiteRT page I may soon be able to just use that for the larger models if I end up settling with Gemma 4.

▲

thot_experiment 9 days ago | parent [-]

Totally understandable. YMMV but I found the llama.cpp build process to work on the first try on my machine, and it only takes a couple minutes, which definitely isn't my usual expectation or experience. I was very pleasantly surprised. Their web-ui is also getting very polished while still doing a great job of letting you tweak all the weird settings.

	▲	dofm 9 days ago \| parent \| next [-]
		Sorry, I sounded a bit terse there! You have probably convinced me to give it a try, to be honest. It's just that, to cut a long story short, I am currently recovering from a level of burnout so severe that twelve months ago had me fully convinced I was actually in early-onset cognitive decline (I am a bit over fifty). Only a little over two months ago I was still sure I'd have to quit IT and find a slow job because I was so out of the loop; this whole industry shift even in just the last few months is so shocking and strange. So I have to be a bit cautious about how many indirections I add, if that makes sense. But I am compiling bigger projects than llama.cpp so I will give it a go. Thank you for the extra detail.
	▲	dofm 5 days ago \| parent \| prev [-]
		I never did quite get round to a llama.cpp build but support is there now anyway and I have the MTP drafter working with the QAT 26B build: https://news.ycombinator.com/item?id=48441450

▲

Patrick_Devine 10 days ago | parent | prev | next [-]

I haven't yet pushed the MTP enabled gemma4 12b model for Ollama because in my testing I wasn't getting a performance bump. The other gemma4 MTP models should work OK right now, but there are some fixes we're just about to push. This is specifically for the MLX backend.

	▲	dofm 9 days ago \| parent [-]
		Thanks for your reply. I will go back and look at Ollama again. So much to learn but this news has really vindicated my decision to direct my limited span of concentration and focus to learning how to use open weights models and opencode.

▲

ch_sm 10 days ago | parent | prev [-]

can‘t speak to compatibility with this new model, but oMLX supports MTP drafters very well.

	▲	dofm 10 days ago \| parent [-]
		Thank you, I will test that.