It is interesting that the Gemini 3 beats every other model on these benchmarks, mostly by a wide margin, but not on SWE Bench. Sonnet is still king here and all three look to be basically on the same level. Kind of wild to see them hit such a wall when it comes to agentic coding

▲

Workaccount2 8 hours ago | parent | next [-]

I think Anthropic is reading the room, and just going to go hard on being "the" coding model. I suppose they feel that if they can win that, they can get an ROI without having to do full blown multimodality at the highest level.

It's probably pretty liberating, because you can make a "spikey" intelligence with only one spike to really focus on.

▲

aerhardt 5 hours ago | parent | next [-]

Codex has been good enough to me and it’s much cheaper.

I code non-trivial stuff with it like multi-threaded code and at least for my style of AI coding which is to do fairly small units of work with multiple revisions it is good enough for me to not to even consider the competition.

Just giving you a perspective on how the benchmarks might not be important at all for some people and how Claude may have a difficult time being the definitive coding model.

	▲	enraged_camel an hour ago \| parent [-]
		>> Codex has been good enough to me and it’s much cheaper. It may be cheaper but it's much, much slower, which is a total flow killer in my experience.

▲

htrp 8 hours ago | parent | prev | next [-]

more playing to their strengths. a giant chunk of their usage data is basically code gen

▲

Miraste 5 hours ago | parent | prev [-]

It remains to be seen whether that works out for them, but it seems like a good bet to me. Coding is the most monetizatable use anyone has found for LLMs so far, and the most likely to persist past this initial hype bubble (if the Singularity doesn't work out :p).

▲

JacobAsmuth 42 minutes ago | parent | prev | next [-]

50% of the CLs in SWE-Bench Verified are the DJango codebase. So if you're a big contributor to Django you should care a lot about that benchmark. Otherwise the difference between models is +-2 tasks done correctly. I wouldn't worry too much about it. Just try it out yourself and see if its any better.

▲

vharish 9 hours ago | parent | prev | next [-]

From my personal experience using the CLI agentic coding tools, I think gemini-cli is fairly on par with the rest in terms of the planning/code that is generated. However, when I recently tried qwen-code, it gave me a better sense of reasoning and structure that geimini. Claude definitely has it's own advantages but is expensive(at least for some if not for all).

My point is, although the model itself may have performed in benchmarks, I feel like there are other tools that are doing better just by adapting better training/tooling. Gemini cli, in particular, is not so great looking up for latest info on web. Qwen seemed to be trained better around looking up for information (or to reason when/how to), in comparision. Even the step-wise break down of work felt different and a bit smoother.

I do, however, use gemini cli for the most part just because it has a generous free quota with very few downsides comparted to others. They must be getting loads of training data :D.

	▲	xnx 8 hours ago \| parent [-]
		Gemini CLI is moving really fast. Noticeable improvements in features and functionality every week.

▲

Palmik 10 hours ago | parent | prev | next [-]

Also does not beat GPT-5.1 Codex on terminal bench (57.8% vs 54.2%): https://www.tbench.ai/

I did not bother verifying the other claims.

▲

HereBePandas 10 hours ago | parent [-]

Not apples-to-apples. "Codex CLI (GPT-5.1-Codex)", which the site refers to, adds a specific agentic harness, whereas the Gemini 3 Pro seems to be on a standard eval harness.

It would be interesting to see the apples-to-apples figure, i.e. with Google's best harness alongside Codex CLI.

▲

Palmik 9 hours ago | parent | next [-]

All evals on Terminal Bench require some harness. :) Or "Agent", as Terminal Bench calls it. Presumably the Gemini 3 are using Gemini CLI.

What do you mean by "standard eval harness"?

	▲	lucassz 42 minutes ago \| parent [-]
		I think the point is that it looks like Gemini 3 was only tested with the generic "Terminus 2", whereas Codex was tested with the Codex CLI.

▲

enraged_camel 10 hours ago | parent | prev [-]

Do you mean that Gemini 3 Pro is "vanilla" like GPT 5.1 (non-Codex)?

▲

HereBePandas 9 hours ago | parent [-]

Yes, two things: 1. GPT-5.1 Codex is a fine tune, not the "vanilla" 5.1 2. More importantly, GPT 5.1 Codex achieves its performance when used with a specific tool (Codex CLI) that is optimized for GPT 5.1 Codex. But when labs evaluate the models, they have to use a standard tool to make the comparisons apples-to-apples.

Will be interesting to see what Google releases that's coding-specific to follow Gemini 3.

	▲	embedding-shape 6 hours ago \| parent [-]
		> But when labs evaluate the models, they have to use a standard tool to make the comparisons apples-to-apples. That'd be a bad idea, models are often trained for specific tools (like GPT Codex is trained for Codex, and Sonnet has been trained with Claude Code in mind), and also vice-versa that the tools are built with a specific model in mind, as they all work differently. Forcing all the models to use the same tool for execution sounds like a surefire way of getting results that doesn't represent real usage, but instead arbitrarily measure how well a model works with the "standard harness", which if people start caring about, will start to become gamed instead.

▲

felipeerias 10 hours ago | parent | prev | next [-]

IMHO coding use cases are much more constrained by tooling than by raw model capabilities at the moment. Perhaps we have finally reached the time of diminishing returns and that will remain the case going forward.

	▲	_factor 9 hours ago \| parent [-]
		This seems preferable. Wasting tokens on tools when a standardized, reliable interface to those tools should be all that's required. The magic of LLMs is that they can understand the latent space of a problem and infer a mostly accurate response. Saying you need to subscribe to get the latest tools is just a sales tactic trained into the models to protect profits.

▲

tosh 10 hours ago | parent | prev | next [-]

This might also hint at SWE struggling to capture what “being good at coding” means.

Evals are hard.

	▲	raducu 9 hours ago \| parent [-]
		> This might also hint at SWE struggling to capture what “being good at coding” means. My take would be that coding itself is hard, but I'm a software engineer myself so I'm biased.

▲

aoeusnth1 6 hours ago | parent | prev | next [-]

Their scores on SWE bench are very close because the benchmark is nearly saturated. Gemini 3 beats Sonnet 4.5 on TerminalBench 2.0 by a nice margin (54% vs. 43%), which is also agentic coding (CLI instead of python).

▲

alyxya 9 hours ago | parent | prev | next [-]

I think Google probably cares more about a strong generalist model rather than solely optimizing for coding.

▲

macrolime 9 hours ago | parent | prev | next [-]

Pretty sure it will beat Sonnet by a wide margin in actual real-world usage.

▲

HereBePandas 10 hours ago | parent | prev | next [-]

[comment removed]

▲

Palmik 10 hours ago | parent [-]

The reported results where GPT 5.1 beats Gemini 3 are on SWE Bench Verified, and GPT 5.1 Codex also beats Gemini 3 on Terminal Bench.

▲

HereBePandas 10 hours ago | parent [-]

You're right on SWE Bench Verified, I missed that and I'll delete my comment.

GPT 5.1 Codex beats Gemini 3 on Terminal Bench specifically on Codex CLI, but that's apples-to-oranges (hard to tell how much of that is a Codex-specific harness vs model). Look forward to seeing the apples-to-apples numbers soon, but I wouldn't be surprised if Gemini 3 wins given how close it comes in these benchmarks.

	▲	Palmik 9 hours ago \| parent [-]
		All evals on Terminal Bench require some harness. :) Or "Agent", as Terminal Bench calls it. Presumably the Gemini 3 are using Gemini CLI.

▲

varispeed 9 hours ago | parent | prev [-]

Never got good code out of Sonnet. It's been Gemini 2.5 for me followed by GPT-5.x.

Gemini is very good a pointing out flaws that are very subtle and non noticeable at a first and second glance.

It also produces code that is easy to reason about. You can then feed it to GPT-5.x for refinement and then back to Gemini for assessment.

	▲	baq 8 hours ago \| parent [-]
		I find Gemini 2.5 pro to be as good or in some cases better for SQL than GPT 5.1. It's aging otherwise, but they must have some good SQL datasets in there for training.