| ▲ | soganess 7 hours ago | ||||||||||||||||
Getting so close to good! I consider Gemma 4 31B (dense / no MoE), the new baseline for local models. It's obviously worse than the frontier models, but it feels less like a science experiment than any previous local model I’ve run, including GPT OSS 120B and Nemotron Super 120B. On my M5 Max with 128 GB of RAM and the full 256K context window, I see RAM use spike to about 70 GB, with something like 14 GB of system overhead. A 64 GB Panther Lake machine with the full Arc B390, or a 48 GB Snapdragon X2 Elite machine, could probably run it with a 128K to 256K context window. Maybe you can squeeze it into 32GB (27.5GB usable) with a 32K context window? Even last year, seeing this kinda performance on a mainstream-ish/plus configuration would have seemed like a pipe dream. | |||||||||||||||||
| ▲ | thot_experiment 4 hours ago | parent | next [-] | ||||||||||||||||
Gemma 4 IS good, I've literally had it get a thing right that Opus 4.7 missed, the edges are ragged and I'm reliably finding usecases where it's basically equivalent. Ultimately the metric is "what can I RELY on it to do". Opus definitely knows a lot more and can sometimes do much more complex tasks, but especially when you're good about feeding the context Gemma is amazing. The difference between the sets of things I trust the two models to do is surprisingly small. I've had some insanely good runs recently working on my personal tooling as well as random projects. The first local model that can reliably left to implement features in agentic mode on non-trivial projects. https://thot-experiment.github.io/gradient-gemma4-31b/ This is a relatively complex piece of tooling built entirely by Gemma 4 inside OpenCode where I manually intervened maybe only 4 times over the course of a few hours. running Q6_K_XL, 128k context @ q8 ~ 800tok/s read 16tok/sec write eagerly awaiting turboquant and MTP in llama.cpp, should take me to 256k and 25-30tok/s if the rumors are true | |||||||||||||||||
| |||||||||||||||||
| ▲ | gertlabs an hour ago | parent | prev | next [-] | ||||||||||||||||
The small Qwen 3.6 models handle context a little better than Gemma 4, but Gemma 4 26B in particular has such small and efficient solutions which are really smart for its weight class. I was so impressed with its performance in our benchmark upon release that I wrote a blog post about it [0], although its position on the leaderboard later fell a bit as we ran it in more long context agentic coding environments. | |||||||||||||||||
| ▲ | pdyc an hour ago | parent | prev | next [-] | ||||||||||||||||
i use smaller model gemma e2b for most of my editing and it works surprisingly well. Workflow is planning with sota models and execution via small models. If you plan properly dont leave ambiguity for smaller model it works well. | |||||||||||||||||
| ▲ | discordance 5 hours ago | parent | prev | next [-] | ||||||||||||||||
Could you please share your time to first token and tok/s? | |||||||||||||||||
| |||||||||||||||||
| ▲ | plufz 2 hours ago | parent | prev [-] | ||||||||||||||||
Does gemma work better than qwen3 in your experience? | |||||||||||||||||