Remix.run Logo
HarHarVeryFunny 21 hours ago

Surely Gemini 3.0 Pro would be the appropriate comparison.

If you want to compare the weakest models from both companies then Gemini Flash vs GPT Instant would seem to be best comparison, although Claude Opus 4.5 is by all accounts the most powerful for coding.

In any case, it will take a few weeks for any meaningful test comparisons to be made, and in the meantime it's hard not to see any release from OpenAI since they announced "Code Red" (aka "we're behind the competition") a few days ago as more marketing than anything else.

koakuma-chan 21 hours ago | parent | next [-]

That's what I said in my original message. By my account, GPT 5.2 is better than Gemini 3 Pro and Opus 4.5

Gemini 3 Pro is a great foundation model. I use as a math tutor, and it's great. I previously used Gemini 2.5 Pro as a math tutor, and Gemini 3 Pro was a qualitative improvement over that. But Gemini 3 Pro sucks at being a coding agent inside a harness. It sucks at tool calling. It's borderline unusable in Cursor because of that, and likely the same in Antigravity. A few weeks ago I attended a demo of Antigravity that Google employees were giving, and it was completely broken. It got stuck for them during the demo, and they ended up not being able to show anything.

Opus 4.5 is good, and faster than GPT-5.2, but less reliable. I use it for medium difficulty tasks. But for anything serious—it's GPT 5.2

HarHarVeryFunny an hour ago | parent | next [-]

I'm curious how you are testing/trying these latest models? Do you have specific test/benchmark tasks that they struggle with that you are trying, and/or are you working on a real project and just trying alternatives where another model is not performing well ?

koakuma-chan an hour ago | parent [-]

I am using Cursor. It has all major models—OpenAI, Anthropic, Google, etc. Every time a new model comes out, I test it on a real project (the app that I am working on at work).

postalcoder 21 hours ago | parent | prev | next [-]

Agreed. Gemini 3 is still pretty bad at agentic coding.

Just yesterday, in Antigravity, while applying changes, it deleted 500 lines of code and replaced it with a `<rest of code goes here>`. Unacceptable behavior in 2025, lol.

misiti3780 17 hours ago | parent [-]

lol

Mkengin 17 hours ago | parent | prev [-]

Your experience seems to match the recent results from swe-rebench: https://swe-rebench.com/

BeetleB 21 hours ago | parent | prev [-]

Gemini 3.0 Flash outperforms Pro in many tasks - I believe the coding benchmark was one of them.

HarHarVeryFunny an hour ago | parent [-]

Presumably that would reflect Gemini 3.0 Flash having more extensive RL for coding training than Pro ? Maybe we can expect a "Gemini 3 Pro Coding" model in the future?

Opus 4.5 seems different - Anthropic's best coding model, but also their frontier general purpose model.