| ▲ | I ran Gemma 4 as a local model in Codex CLI(blog.danielvaughan.com) | ||||||||||||||||
| 68 points by dvaughan 14 hours ago | 26 comments | |||||||||||||||||
| ▲ | mhitza 2 hours ago | parent | next [-] | ||||||||||||||||
> The finding I did not expect: model quality matters more than token speed for agentic coding. I'm really surprised how that was not obvious. Also, instead of limiting context size to something like 32k, at the cost of ~halving token generation speed, you can offload MoE stuff to the CPU with --cpu-moe. | |||||||||||||||||
| ▲ | tuzemec 3 hours ago | parent | prev | next [-] | ||||||||||||||||
I'm currently experimenting with running google/gemma-4-26b-a4b with lm studio (https://lmstudio.ai/) and Opencode on a M3 Ultra with 48Gb RAM. And it seems to be working. I had to increase the context size to 65536 so the prompts from Opencode would work, but no other problems so far. I tried running the same on an M3 Max with less memory, but couldn't increase the context size enough to be useful with Opencode. It's also easy to integrate it with Zed via ACP. For now it's mostly simple code review tasks and generating small front-end related code snippets. | |||||||||||||||||
| |||||||||||||||||
| ▲ | meander_water an hour ago | parent | prev | next [-] | ||||||||||||||||
I would have liked to see quality results between the different quantization methods - Q4_K_M, Q_8_0, Q_6_K rather than tok/s | |||||||||||||||||
| ▲ | dajonker an hour ago | parent | prev | next [-] | ||||||||||||||||
I don't really have the hardware to try it out, but I'm curious to see how Qwen3.5 stacks up against Gemma 4 in a comparison like this. Especially this model that was fine tuned to be good at tool calling that has more than 500k downloads as of this moment: https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-... | |||||||||||||||||
| |||||||||||||||||
| ▲ | egorfine 2 hours ago | parent | prev | next [-] | ||||||||||||||||
Related: I have upgraded my M4 Pro 24GB to M5 Pro 48GB yesterday. The same Gemma 4 MoE model (Q4) runs about 8x more t/s on M5 Pro and loads 2x times faster from disk to memory. Gonna run some more tests later today. | |||||||||||||||||
| |||||||||||||||||
| ▲ | zihotki 2 hours ago | parent | prev | next [-] | ||||||||||||||||
For coding it makes no sense to use any quantization worse than Q6_K, from my experience. More quantized models make more mistakes and if for text processing it still can be fine, for coding it's not. | |||||||||||||||||
| ▲ | danilop 2 hours ago | parent | prev | next [-] | ||||||||||||||||
Nice walkthrough and interesting findings! The difference between the MoE and the dense models seems to be bigger than what benchmarks report. It makes sense because a small gain in toll planning and handling can have a large influence on results. | |||||||||||||||||
| ▲ | karpetrosyan 2 hours ago | parent | prev | next [-] | ||||||||||||||||
I think local models are not yet that good or fast for complex things, so I am just using local Gemma 4 for some dummy refactorings or something really simple. | |||||||||||||||||
| ▲ | vsrinivas 8 hours ago | parent | prev | next [-] | ||||||||||||||||
Hey - I use the same, w/ both gemma4 and gpt-oss-*; some things I have to do for a good experience: 1) Pin to an earlier version of codex (sorry) - 0.55 is the best experience IME, but YMMV (see https://github.com/openai/codex/issues/11940, https://github.com/openai/codex/issues/8272). 2) Use the older completions endpoint (llama.cpp's responses support is incomplete - https://github.com/ggml-org/llama.cpp/issues/19138) | |||||||||||||||||
| ▲ | Havoc 2 hours ago | parent | prev | next [-] | ||||||||||||||||
You can also try speculative decoding with the E2B model. Under some conditions it can result in a decent speed up | |||||||||||||||||
| ▲ | blackmanta 9 hours ago | parent | prev | next [-] | ||||||||||||||||
With a nvidia spark or 128gb+ memory machine, you can get a good speed up on the 31B model if you use the 26B MoE as a draft model. It uses more memory but I’ve seen acceptance rate at around 70%+ using Q8 on both models | |||||||||||||||||
| |||||||||||||||||
| ▲ | OutOfHere an hour ago | parent | prev | next [-] | ||||||||||||||||
Gemma 4 is a strongly censored model, so much so that it refused to answer medical and health related questions, even basic ones. No one should be using it, and if this is the best that Google can do, it should stop now. Other models do not have such ridiculous self-imposed problems. | |||||||||||||||||
| |||||||||||||||||
| ▲ | ehtbanton 9 hours ago | parent | prev | next [-] | ||||||||||||||||
This is genuinely very helpful. I'm planning a MacBook pro purchase with local inference in mind and now see I'll have to aim for a slightly higher memory option because the Gemma A4 26B MoE is not all that! | |||||||||||||||||
| |||||||||||||||||
| ▲ | anactofgod 9 hours ago | parent | prev | next [-] | ||||||||||||||||
Amazing. Thanks for your detailed posts on the bake-off between the Mac and GB10, Daniel, and on your learnings. I had trying similar on both compute platforms on my to-do list. Your post should save me a lot of debugs, sweat, and tears. | |||||||||||||||||
| ▲ | fortyseven 9 hours ago | parent | prev | next [-] | ||||||||||||||||
I've been VERY impressed with Gemma4 (26B at the moment). It's the first time I've been able to use OpenCode via a llamacpp server reliably and actually get shit done. In fact, I started using it as a coding partner while learning how to use the Godot game engine (and some custom 'skills' I pulled together from the official docs). I purposely avoided Claude and friends entirely, and just used Gemma4 locally this week... and it's really helped me figure out not just coding issues I was encountering, but also helped me sift through the documentation quite readily. I never felt like I needed to give in and use Claude. Very, very pleased. | |||||||||||||||||
| ▲ | brcmthrowaway 9 hours ago | parent | prev [-] | ||||||||||||||||
Nothing about omlx? | |||||||||||||||||