Qwen 3.6 burns it to the ground. it was not even a challenge. Gemma4 seriously fails at toolcalls and agentic works. It got all messed up after 2-3 turns of Vibecoding.

▲ xrd 2 days ago | parent | next [-]

How do you run it? vllm? llama.cpp?

Can you share some parameters you enable tool calling and agentic usage?

Or, higher level, some philosophies on what approaches you are using for tuning to get better tool calling and/or agentic usage?

I'm having surprisingly good success with unsloth/Qwen3.6-27B-GGUF:Q4_K_M (love unsloth guys) on my RTX3090/24GB using opencode as the orchestrator.

It concocts some misleading paths, but the code often compiles, and I consider that a victory.

You have to watch it like you would watch a 14 year old boy who says he is doing his homework but you hear the sound effects of explosions.

	▲	jyap 2 days ago \| parent [-]
		I run it with Llama.cpp on my RTX 3090. Also using the same Unsloth model. My config is similar to: https://github.com/noonghunna/club-3090/blob/master/docs/eng... I need to try out some of the other set ups mentioned in this repo for increased TPS.

▲ thot_experiment 2 days ago | parent | prev | next [-]

naw, i mean i prefer Qwen 3.6 to Gemma 90% of the time, especially the MoE with a light tune to make it's tone more claude-like, but Gemma 4 is definitely better in some cases and I think they're pretty close in general.

The difference basically boils down to Gemma 4 making more assumptions and Qwen 3.6 sticking closer to the prompt, if your prompt is bad or leaves things up to the imagination, Gemma will do a better job, if you need strict prompt adherence Qwen is better. Since local models are "dumb" i think it makes sense to prefer prompt adherence, but there are complex tasks that Gemma will complete much much faster than Qwen because it makes the right assumptions the first time and as a result even with slower inference requires way fewer turns.

My speculation is that this comes from google having a much better strategy for filtering their training data, I think this also shows up in the shape of the world knowledge of the models. Gemma's world knowledge seems deeper even though the models are of roughly equivalent size to the Qwen counterparts so it's mostly likely just concentrated in places that are more relevant to my queries.

Most notably in my testing, Gemma 4 31b is the ONLY local model that will tell me the significance of 1738 correctly. Even most flagship/cloud models answer with some hallucinatory nonsense.

▲ 59nadir 2 days ago | parent | prev | next [-]

Counter-point: I built an agent that can only interface with Kakoune, a much less common and more challenging situation for an LLM to find itself in, and Gemma4-A4B 8bit quantized does remarkably better in actually figuring out how to get text in buffers than Qwen3.6-35B-A3B in a similar class as Gemma4 A4B.

Now, is this the usual use case? No, it's a benchmark I created specifically in order to put LLMs in situations where they can't just blast out their bash commands without having to interface with something else and adapt.

▲

celrod 2 days ago | parent [-]

Fellow kakoune user here. I'm curious about your use case/ what you're doing with it!

▲

59nadir 2 days ago | parent [-]

I'm just messing around with building agents, that's all. I'm not super interested in making ones that just sit in a terminal executing shell scripts because truth be told they're absolutely trivial to make and don't show any interesting parts of LLMs, whereas telling an agent that they are sitting in Kakoune is a whole lot more interesting and really shows a lot of what LLMs aren't great at, and how they'll have to fight their urge to spit out overwrought bash invocations or at the very least find a way to fit those into something new.

So far the only tools the agent has access to are `evaluate_commands(commands=["...", "..."])` and `get_buffer_contents()`, which really makes them have to work for doing things. I could make it super easy for them but then it wouldn't be an interesting experiment.

	▲	59nadir a day ago \| parent [-]
		As an addendum to this: If I were to try to make something more useful out of this, I'd probably add the ability for LLMs to list buffers, probably give them an easier out for executing shell scripts in the way they prefer, give them an easier time to list docs and a few other things like that. The tools and the interaction with Kakoune is really trivial to write; I already use this by having the agent write to the session FIFO (a very simple binary format) and I extract information via my own FIFO that Kakoune writes to (this is used for the buffer data only right now). I think once you started using it more as a tool and not a pseudo-benchmark like I am you'd probably think of even more things to add but a lot of it comes down to just making Kakoune's state visible and making shell spam (which the LLMs love) easier.

▲ 2ndorderthought 2 days ago | parent | prev | next [-]

Gemma4 is definitely not used for vibe/agentic coding. Not even worth trying. But its a different weight class.

▲ lambda 2 days ago | parent | prev | next [-]

Gemma 4 31b was working ok for me; but it was consuming tons of memory on SWA checkpoints, I had to turn them way down, and as a 31b dense model is fairly slow on a Strix Halo. I did have a lot of tool calling issues on 26b-a4b, though.

The Qwen models are quite solid though.

▲ xrd 2 days ago | parent [-]

What are you using to run it vllm, llama.cpp or other?

Can you share your switches and approach for using tools?

	▲	lambda 2 days ago \| parent [-]
		llama.cpp My setup is a bit of a mess as I experiment with different ways of configuring and hosting local models. So at some point I was experimenting with the router server but stopped doing that, but some of my settings are still in models.ini while some are on the command line. podman run --env "HF_TOKEN=$HF_TOKEN" --env "LLAMA_SERVER_SLOTS_DEBUG=1" -p 8080:8080 --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined --security-opt label=disable --rm -it -v ~/.cache/huggingface/:/root/.cache/huggingface/ -v ./unsloth:/app/unsloth -v ./models.ini:/app/models.ini llama.cpp-rocm7.2 -hf unsloth/gemma-4-31B-it-GGUF:UD-Q8_K_XL --chat-template-file /root/.cache/huggingface/gemma-4-31B-it-chat_template.jinja -ctxcp 8 --port 8080 --host 0.0.0.0 -dio --models-preset models.ini With the following as the relevant settings in models.ini (I actually have no idea if these settings are applied when not using the router server, it's been hard for me to figure out what settings are actually applied when using bot the command line and models.ini `[*] jinja = true seed = 3407 flash-attn = on [unsloth/gemma-4-31B-it-GGUF:UD-Q8_K_XL] temperature = 1.0 top_p = 0.95 top_k = 64` And it looks like the chat_template.jinja I have is actually out of date by now, there was a new one pushed just a couple of days ago that seems to have some further tool calling fixes: https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_... As my harness, I'm using pi, with a pretty vanilla config. Anyhow, Gemms 4 31b worked in this config, but it was slow and RAM hungry. Since then, I've mostly moved to Qwen 3.6 35b-a3b because it's a lot faster. I'm not actually doing anything useful with these yet, but I've used them for some experiments and Qwen 3.6 35b-a3b was capable of doing some pretty long mostly unsupervised agentic loops in my experimentation.

▲ blurbleblurble 2 days ago | parent | prev | next [-]

I agree but would add that gemma 4 is really nice at vibing though in ways qwen 3.6 could never.

Maybe it could be fun to hook them up via a2a protocol as left and right brain agents operating in tandem.

	▲	copper-float a day ago \| parent [-]
		As someone who has never used AI for any coding or agent tasks, I feel like i'm going insane when I read things like this. >Maybe it could be fun to hook them up via a2a protocol as left and right brain agents operating in tandem What in the world does this even mean?

▲ BoredomIsFun 21 hours ago | parent | prev [-]

> Qwen 3.6 burns it to the ground.

Not for creative writing or NLP.