Sounds like a game changer if I see that kind of speed up on my hardware. So far I've prefered Qwen 3.6 because of its better tool handling, even though Gemma 4 is faster, but I saw they've updated the model template and that's supposed to be better now. Looking forward to trying this with llama.cpp.

▲

ch_sm 3 hours ago | parent [-]

gemma4 has a specific problem with toolcalls that affects most runtimes. fixes for ollama and vllm are being worked on right now

	▲	adrian_b 2 hours ago \| parent \| next [-]
		The chat templates of all Gemma 4 models have been updated 7 days ago, to fix some bugs related to invoking tools. So any tests done with models that have not been updated during the last days are no longer relevant and they must be repeated after updating the models and regenerating any other file formats, like GGUF files.
	▲	apexalpha 3 hours ago \| parent \| prev [-]
		I read somewhere you need to drop temp to 0.1 on gemma for tools. Not sure why (too amateur sorry). Though I think qwen was natively trained on toolcalling.