Remix.run Logo
vessenes 2 days ago

Have you tried the Gemma 4 series, out of curiosity? I haven’t run a local model in a while, but the benchmarks look good. I’d take a free local tool-use model if it was relatively consistent.

v3ss0n 2 days ago | parent | next [-]

Qwen 3.6 burns it to the ground. it was not even a challenge. Gemma4 seriously fails at toolcalls and agentic works. It got all messed up after 2-3 turns of Vibecoding.

xrd 2 days ago | parent | next [-]

How do you run it? vllm? llama.cpp?

Can you share some parameters you enable tool calling and agentic usage?

Or, higher level, some philosophies on what approaches you are using for tuning to get better tool calling and/or agentic usage?

I'm having surprisingly good success with unsloth/Qwen3.6-27B-GGUF:Q4_K_M (love unsloth guys) on my RTX3090/24GB using opencode as the orchestrator.

It concocts some misleading paths, but the code often compiles, and I consider that a victory.

You have to watch it like you would watch a 14 year old boy who says he is doing his homework but you hear the sound effects of explosions.

jyap 2 days ago | parent [-]

I run it with Llama.cpp on my RTX 3090. Also using the same Unsloth model.

My config is similar to: https://github.com/noonghunna/club-3090/blob/master/docs/eng...

I need to try out some of the other set ups mentioned in this repo for increased TPS.

thot_experiment 2 days ago | parent | prev | next [-]

naw, i mean i prefer Qwen 3.6 to Gemma 90% of the time, especially the MoE with a light tune to make it's tone more claude-like, but Gemma 4 is definitely better in some cases and I think they're pretty close in general.

The difference basically boils down to Gemma 4 making more assumptions and Qwen 3.6 sticking closer to the prompt, if your prompt is bad or leaves things up to the imagination, Gemma will do a better job, if you need strict prompt adherence Qwen is better. Since local models are "dumb" i think it makes sense to prefer prompt adherence, but there are complex tasks that Gemma will complete much much faster than Qwen because it makes the right assumptions the first time and as a result even with slower inference requires way fewer turns.

My speculation is that this comes from google having a much better strategy for filtering their training data, I think this also shows up in the shape of the world knowledge of the models. Gemma's world knowledge seems deeper even though the models are of roughly equivalent size to the Qwen counterparts so it's mostly likely just concentrated in places that are more relevant to my queries.

Most notably in my testing, Gemma 4 31b is the ONLY local model that will tell me the significance of 1738 correctly. Even most flagship/cloud models answer with some hallucinatory nonsense.

59nadir 2 days ago | parent | prev | next [-]

Counter-point: I built an agent that can only interface with Kakoune, a much less common and more challenging situation for an LLM to find itself in, and Gemma4-A4B 8bit quantized does remarkably better in actually figuring out how to get text in buffers than Qwen3.6-35B-A3B in a similar class as Gemma4 A4B.

Now, is this the usual use case? No, it's a benchmark I created specifically in order to put LLMs in situations where they can't just blast out their bash commands without having to interface with something else and adapt.

celrod 2 days ago | parent [-]

Fellow kakoune user here. I'm curious about your use case/ what you're doing with it!

59nadir 2 days ago | parent [-]

I'm just messing around with building agents, that's all. I'm not super interested in making ones that just sit in a terminal executing shell scripts because truth be told they're absolutely trivial to make and don't show any interesting parts of LLMs, whereas telling an agent that they are sitting in Kakoune is a whole lot more interesting and really shows a lot of what LLMs aren't great at, and how they'll have to fight their urge to spit out overwrought bash invocations or at the very least find a way to fit those into something new.

So far the only tools the agent has access to are `evaluate_commands(commands=["...", "..."])` and `get_buffer_contents()`, which really makes them have to work for doing things. I could make it super easy for them but then it wouldn't be an interesting experiment.

59nadir a day ago | parent [-]

As an addendum to this:

If I were to try to make something more useful out of this, I'd probably add the ability for LLMs to list buffers, probably give them an easier out for executing shell scripts in the way they prefer, give them an easier time to list docs and a few other things like that.

The tools and the interaction with Kakoune is really trivial to write; I already use this by having the agent write to the session FIFO (a very simple binary format) and I extract information via my own FIFO that Kakoune writes to (this is used for the buffer data only right now).

I think once you started using it more as a tool and not a pseudo-benchmark like I am you'd probably think of even more things to add but a lot of it comes down to just making Kakoune's state visible and making shell spam (which the LLMs love) easier.

2ndorderthought 2 days ago | parent | prev | next [-]

Gemma4 is definitely not used for vibe/agentic coding. Not even worth trying. But its a different weight class.

lambda 2 days ago | parent | prev | next [-]

Gemma 4 31b was working ok for me; but it was consuming tons of memory on SWA checkpoints, I had to turn them way down, and as a 31b dense model is fairly slow on a Strix Halo. I did have a lot of tool calling issues on 26b-a4b, though.

The Qwen models are quite solid though.

xrd 2 days ago | parent [-]

What are you using to run it vllm, llama.cpp or other?

Can you share your switches and approach for using tools?

lambda 2 days ago | parent [-]

llama.cpp

My setup is a bit of a mess as I experiment with different ways of configuring and hosting local models. So at some point I was experimenting with the router server but stopped doing that, but some of my settings are still in models.ini while some are on the command line.

podman run --env "HF_TOKEN=$HF_TOKEN" --env "LLAMA_SERVER_SLOTS_DEBUG=1" -p 8080:8080 --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined --security-opt label=disable --rm -it -v ~/.cache/huggingface/:/root/.cache/huggingface/ -v ./unsloth:/app/unsloth -v ./models.ini:/app/models.ini llama.cpp-rocm7.2 -hf unsloth/gemma-4-31B-it-GGUF:UD-Q8_K_XL --chat-template-file /root/.cache/huggingface/gemma-4-31B-it-chat_template.jinja -ctxcp 8 --port 8080 --host 0.0.0.0 -dio --models-preset models.ini

With the following as the relevant settings in models.ini (I actually have no idea if these settings are applied when not using the router server, it's been hard for me to figure out what settings are actually applied when using bot the command line and models.ini

  [*]
  jinja = true
  seed = 3407
  flash-attn = on

  [unsloth/gemma-4-31B-it-GGUF:UD-Q8_K_XL]
  temperature = 1.0
  top_p = 0.95
  top_k = 64
And it looks like the chat_template.jinja I have is actually out of date by now, there was a new one pushed just a couple of days ago that seems to have some further tool calling fixes: https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_...

As my harness, I'm using pi, with a pretty vanilla config.

Anyhow, Gemms 4 31b worked in this config, but it was slow and RAM hungry. Since then, I've mostly moved to Qwen 3.6 35b-a3b because it's a lot faster.

I'm not actually doing anything useful with these yet, but I've used them for some experiments and Qwen 3.6 35b-a3b was capable of doing some pretty long mostly unsupervised agentic loops in my experimentation.

blurbleblurble 2 days ago | parent | prev | next [-]

I agree but would add that gemma 4 is really nice at vibing though in ways qwen 3.6 could never.

Maybe it could be fun to hook them up via a2a protocol as left and right brain agents operating in tandem.

copper-float a day ago | parent [-]

As someone who has never used AI for any coding or agent tasks, I feel like i'm going insane when I read things like this.

>Maybe it could be fun to hook them up via a2a protocol as left and right brain agents operating in tandem

What in the world does this even mean?

BoredomIsFun 21 hours ago | parent | prev [-]

> Qwen 3.6 burns it to the ground.

Not for creative writing or NLP.

zkmon 2 days ago | parent | prev | next [-]

I have tested Gemma4-26B against Qwen3.6-35B. Gemma beats Qwen on structured data extraction and instruction following. Gemma is far more precise than Qwen in these tasks, while Qwen gets a bit more creative, verbose, and imprecise. However Qwen has far more general smartness, high token throughput. Qwen could precisely pinpoint the issues in data quality and code, while Gemma had no clue. On the coding skills, Qwen appears to have edge over Gemma, but this could depend on the agent you use. For direct chat (llama_cpp UI), bot models show same skills for coding.

seemaze 2 days ago | parent [-]

That's interesting. I've been using Qwen3.5-35B for (poorly) structured table extraction based largely on the reports that Qwen had a much better vision implementation.

I have not benchmarked Qwen3.5 vs. Qwen3.6 for the same task, nor trialed Gemma4-26B. Guess it's time for some testing!

2ndorderthought 2 days ago | parent | prev [-]

I tried the Gemma 4 I think 2 and 4b. The 2b was not useful for me at all. A little too weak for my use cases

The 4b was okay. It didn't get all of my small math questions right, it didn't know about some of the libraries I use, but it was able to do some basic auto complete type stuff. For microscopic models I like the llama 3.2 3b more right now for what I do, it's a little faster and seems a little stronger for what I do. But everyone is different and I don't think I'll use it anymore this past month has been crazy for local model releases.

throwaw12 2 days ago | parent [-]

can you share your use cases for 2b and 4b models?

curious how people are leveraging these models

2ndorderthought 2 days ago | parent | next [-]

For me, I use them for quick auto complete or small questions. I am not a vibe/agentic coder. I know I am a relic and a Luddite because of this.

Instead of hitting stack overflow and Google I will ask questions like "can you give me an example of how to do x in library y?" Or "this error is appearing what might be happening if I checked a b and c". Or "please write unit tests for this function". Or code auto complete.

I am not looking for the world's best answer from a 3b model. I am looking for a super fast answer that reminds me of things I already know or maybe just maybe gives me a fast idea to stub something while I focus on something more important, I am going to refactor anyways. Think a low quality rubber duck

I mostly use 7-9b models for this now but llama 3.2 3b is pretty decent for not hogging resources while say I have other compute heavy operations happening on a weak computer.

Probably half the questions people ask chatgpt could get roughly the same quality of answer with a small model in my opinion. You can't fully trust an LLM anyways so the difference between 60% and 70% accuracy isn't as much are marketing makes it sound like. That said the quality of a good 7-9b model is worth it compared to a 3b if your machine can run it. Furthermore the quality of qwen 36 is crazy and makes me wonder if I will ever need an AI provider again if the trend continues.

SwellJoe 2 days ago | parent | prev [-]

Over the weekend I used the small models for experimental training runs when figuring out how to build LoRAs. It takes a lot less time to do smoke tests of the process on E2B vs the 31B version. And E4B was a reasonable stop along the line just to make sure the LoRA combined with the base model to produce coherent output.

Also, they're good enough for a lot of simple categorization and data extraction tasks, e.g. something like "flag abusive posts/comments", or "visit website, find the contact info, open hours, address". And they run fast on the kind of hardware you're likely to have at home, while the bigger dense versions decidedly do not.

I used Gemma 4 itself to review and prune the data (my social media posts over the last ~5 years, about 5 million words) being ingested into the training process for a LoRA for Gemma 4. I found the bigger model (31B) was more nuanced and useful than the smaller ones, and I wasn't in a big hurry by that stage of the process, so I used the big one overnight. Gemma 4 31B was also a better judge of my writing than Gemini Flash 2.5, by my reckoning.

It was, again, more nuanced, and was able to recognize a generally helpful comment that opened kinda jokey/rude, while the smaller model and Gemini 2.5 Flash tended to gravitate toward extremes (1 or 5) rather than the 1-5 scale they were prompted to rate on. I assume Gemini 3.1 Flash is probably competitive or better, but I didn't try it, since I liked the results the self-hosted Gemma 4 was giving for free.

The little ones also run great on very modest hardware. Both run at comfortable interactive speed mid-range tablets. E4B is blazing fast on an iPad M4 or Pixel 10 Pro and entirely usable on a midrange Android with sufficient RAM.