Remix.run Logo
koakuma-chan a day ago

I can confirm GPT 5.2 is better than Gemini and Claude. GPT 5.2 Codex is probably even better.

cj a day ago | parent | next [-]

Gemini 2.5 or 3? (3 was released yesterday)

koakuma-chan a day ago | parent | next [-]

I tried Gemini 3 Flash, and I am unimpressed. It's maybe a competitor to Cursor's Compose-1, but completely different league from GPT 5.2

HarHarVeryFunny 21 hours ago | parent | next [-]

Surely Gemini 3.0 Pro would be the appropriate comparison.

If you want to compare the weakest models from both companies then Gemini Flash vs GPT Instant would seem to be best comparison, although Claude Opus 4.5 is by all accounts the most powerful for coding.

In any case, it will take a few weeks for any meaningful test comparisons to be made, and in the meantime it's hard not to see any release from OpenAI since they announced "Code Red" (aka "we're behind the competition") a few days ago as more marketing than anything else.

koakuma-chan 21 hours ago | parent | next [-]

That's what I said in my original message. By my account, GPT 5.2 is better than Gemini 3 Pro and Opus 4.5

Gemini 3 Pro is a great foundation model. I use as a math tutor, and it's great. I previously used Gemini 2.5 Pro as a math tutor, and Gemini 3 Pro was a qualitative improvement over that. But Gemini 3 Pro sucks at being a coding agent inside a harness. It sucks at tool calling. It's borderline unusable in Cursor because of that, and likely the same in Antigravity. A few weeks ago I attended a demo of Antigravity that Google employees were giving, and it was completely broken. It got stuck for them during the demo, and they ended up not being able to show anything.

Opus 4.5 is good, and faster than GPT-5.2, but less reliable. I use it for medium difficulty tasks. But for anything serious—it's GPT 5.2

HarHarVeryFunny an hour ago | parent | next [-]

I'm curious how you are testing/trying these latest models? Do you have specific test/benchmark tasks that they struggle with that you are trying, and/or are you working on a real project and just trying alternatives where another model is not performing well ?

koakuma-chan an hour ago | parent [-]

I am using Cursor. It has all major models—OpenAI, Anthropic, Google, etc. Every time a new model comes out, I test it on a real project (the app that I am working on at work).

postalcoder 21 hours ago | parent | prev | next [-]

Agreed. Gemini 3 is still pretty bad at agentic coding.

Just yesterday, in Antigravity, while applying changes, it deleted 500 lines of code and replaced it with a `<rest of code goes here>`. Unacceptable behavior in 2025, lol.

misiti3780 17 hours ago | parent [-]

lol

Mkengin 17 hours ago | parent | prev [-]

Your experience seems to match the recent results from swe-rebench: https://swe-rebench.com/

BeetleB 21 hours ago | parent | prev [-]

Gemini 3.0 Flash outperforms Pro in many tasks - I believe the coding benchmark was one of them.

HarHarVeryFunny an hour ago | parent [-]

Presumably that would reflect Gemini 3.0 Flash having more extensive RL for coding training than Pro ? Maybe we can expect a "Gemini 3 Pro Coding" model in the future?

Opus 4.5 seems different - Anthropic's best coding model, but also their frontier general purpose model.

walthamstow 21 hours ago | parent | prev [-]

Glad I'm not alone in thinking Flash 3 was like Composer 1 in speed but smarter

Tostino 20 hours ago | parent | prev [-]

3 has been out for at least a couple weeks for me.

koakuma-chan 19 hours ago | parent [-]

He meant 3 Flash, which came out recently

speedgoose 21 hours ago | parent | prev | next [-]

It’s significantly slower though. At least for my use cases I rather ask Claude 4.5 opus and switch to GPT if Claude is stuck.

a day ago | parent | prev | next [-]
[deleted]
nunodonato 21 hours ago | parent | prev [-]

I'm gonna call bs on these kind of comments. "better" on what? Coding models shouldn't even be compared isolated. A big part of making it work in a real/big codebase is the tool that calls the model (claude code, gemini-cli, etc). I'll bet claude code will still keep stealing your lunch every day of the week against any competitor out there

koakuma-chan 21 hours ago | parent | next [-]

I haven't used CC in a few months, what killer features have they added? I am using Cursor, it's clunky, but not that clunky so as to completely destroy model performance. I am pretty sure for my tasks (undocumented, buggy, legacy JavaScript project) GPT-5.2 is > all on any decent harness, because it doesn't give up or half-ass. It can run for 5 minutes or for 50 minutes, depending on your request.

dkdcio 21 hours ago | parent | next [-]

lol bold claim initially for not using the primary competitor in months. I try to use all 3 (Claude Code, Codex CLI, Gemini CLI); there are tradeoffs between all 3

koakuma-chan 21 hours ago | parent [-]

Read my reply to sibling comment. To my knowledge, Claude Code is at most marginally better than Cursor, and it's mostly the model that matters. Not saying there is no room for improvement on the tooling side, but no one seems to have come up with anything so far. Let me know which killer features Claude Code has, I would be happy to learn.

dkdcio 21 hours ago | parent [-]

it’s the “agentic harness” — they have shipped tons of great features for the DevEx, but it’s the combination of better models (Sonnet 4.5 1M, now Opus 4.5) and the “prompting”/harness that improves how it actually performs

again I’m not saying Codex is worse, they’re just different and claiming the only one you actively use is the best is a stretch

edit: also FWIW, I initially dismissed Claude Code at launch, then loved Codex when it released. never really liked Cursor. now I primarily use Claude Code given I found Codex slow and less “reliable” in a sense, but I try to try all 3 and keep up with the changes (it is hard)

koakuma-chan 21 hours ago | parent [-]

> they have shipped tons of great features for the DevEx

Such as?

> again I’m not saying Codex is worse, they’re just different and claiming the only one you actively use is the best is a stretch

I am testing all models in Cursor.

> I initially dismissed Claude Code at launch, then loved Codex when it released. never really liked Cursor

I also don't actually like Cursor. It's a VSCode fork, and a mediocre harness. I am only using it because my company refuses to buy anything else, because Cursor has all models, and it appears to them that it's not worth having anything else.

dkdcio 21 hours ago | parent [-]

you conveniently ignored the most important part of my comment :)

> Such as?

changelog is here: https://github.com/anthropics/claude-code/blob/main/CHANGELO...

glhf

btw you started this thread with pure vibes, no evidence:

> I can confirm GPT 5.2 is better than Gemini and Claude. GPT 5.2 Codex is probably even better.

I’m saying you’re wrong. N=2, 1 against 1, one of us is making a much less bold claim

koakuma-chan 20 hours ago | parent [-]

You do not seem to be able to tell me anything substantial, i.e. specifically how Claude Code is a better harness than Cursor.

> “prompting”/harness that improves how it actually performs

Is an abstract statement without any meaningful details.

nunodonato 21 hours ago | parent | prev | next [-]

it's not about features (although they've added plenty), its the internal tooling and the way the model is prompted.

koakuma-chan 21 hours ago | parent [-]

The only thing I know that CC has that Cursor hasn't, is the ability to spawn agents. You can just prompt CC "spawn 10 agents" and it will make 10 subagents that run concurrently. But otherwise, I don't know what CC does that Cursor doesn't. On the contrary, AFAIK, CC doesn't index your codebase, and Cursor does.

NoveltyEngine 19 hours ago | parent [-]

Surely CC has a lower price? How much do you have to pay Cursor for equivalent to what's provided in a 20x Claude Max plan?

mejutoco 17 hours ago | parent | next [-]

200$

https://cursor.com/pricing

koakuma-chan 18 hours ago | parent | prev [-]

I don't know. My company pays for it.

nunodonato 20 hours ago | parent | prev [-]

we dont have capability to see the inner working of claude code, its not open source. You just use it and you see the difference. I've tried all of them, including anti-gravity. Nothing beats claude code

HumanOstrich 18 hours ago | parent | next [-]

You can trace what's going back and forth over the wire between Claude Code and the model in use. That's going to be more insightful than their huge blob of JavaScript using React to render a terminal GUI.

16 hours ago | parent | prev [-]
[deleted]
Mkengin 17 hours ago | parent | prev [-]

According to SWE-Rebench Anthropic and OpenAI are really close in performance, while GPT-5.2 costs less than half the cost of CC per problem.

https://swe-rebench.com/