Remix.run Logo
jumploops a day ago

Interesting, the new model's prompt is ~half the size (10KB vs. 23KB) of the previous prompt[0][1].

SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors (via internal refactor benchmark 33.9% -> 51.3%).

As someone who recently used Codex CLI (`gpt-5-high`) to do a relatively large refactor (multiple internal libs to dedicated packages), I kept running into bugs introduced when the model would delete a file and then rewrite in (missing crucial or important details). My approach would have been to just the copy the file over and then make package-specific changes, so maybe better tool calling is at play here.

Additionally, they claim the new model is more steerable (both with AGENTS.md and generally). In my experience, Codex CLI w/gpt-5 is already a lot more steerable than Claude Code, but any improvements are welcome!

[0]https://github.com/openai/codex/blob/main/codex-rs/core/gpt_...

[1]https://github.com/openai/codex/blob/main/codex-rs/core/prom...

tedsanders a day ago | parent | next [-]

> SWE-bench performance is similar to normal gpt-5, so it seems the main delta with `gpt-5-codex` is on code refactors

SWE-bench is a great eval, but it's very narrow. Two models can have the same SWE-bench scores but very different user experiences.

Here's a nice thread on X about the things that SWE-bench doesn't measure:

https://x.com/brhydon/status/1953648884309536958

dwaltrip a day ago | parent [-]

so annoying you cant read replies without an account nowadays

Tiberium a day ago | parent | next [-]

Use Nitter, the main instance works but there are a lot of other instances as well.

https://nitter.net/brhydon/status/1953648884309536958

dcre 20 hours ago | parent | prev [-]

Change the url from x.com to xcancel.com to see it all.

pants2 a day ago | parent | prev | next [-]

Interestingly, "more steerable" can sometimes be a bad thing, as it will tend to follow your prompt to the letter even if that's against your interests. It requires better prompting and generally knowing what you're doing - might be worse for vibe-coders and better for experienced SWEs.

jumploops a day ago | parent | next [-]

Yes, given a similarly sparse prompt, Claude Code seems to perform "better" because it eagerly does things you don't necessarily know to ask

GPT-5 may underwhelm with the same sparse prompt, as it seems to do exactly what's asked, not more

You can still "fully vibe" with GPT-5, but the pattern works better in two steps:

1. Plan (iterate on high-level spec/PRD, split into actions)

2. Build (work through plans)

Splitting the context here is important, as any LLM will perform worse as the context gets more polluted.

Turskarama 20 hours ago | parent [-]

The best of both worlds would surely be for the LLM to write what you've asked, but also write comments about other things it could have done so you can consider those extra bits when you check the output.

htrp a day ago | parent | prev [-]

think they're indexing here for professional work (people in the VSCode terminal)

siva7 a day ago | parent | prev | next [-]

So you're all saying suddenly codex cli w gpt 5 codex is better than claude code? Hard to believe

jumploops a day ago | parent | next [-]

Not suddenly, it's been better since GPT-5 launched.

Prompting is different, but in a good way.

With Claude Code, you can use less prompting, and Claude will get token happy and expand on your request. Great for greenfield/vibing, bad for iterating on existing projects.

With Codex CLI, GPT-5 seems to handle instructions much more precisely. It won't just go off on it's own and do a bunch of work, it will do what you ask.

I've found that being more specific up-front gets better results with GPT-5, whereas with Claude, being more specific doesn't necessarily stop the eagerness of it's output.

As with all LLMs, you can't compare apples to oranges, so to clarify, my experiences are primarily with Typescript and Rust codebases.

srcreigh a day ago | parent | next [-]

Codex CLI of course will sometimes do the wrong thing, or sometimes do something extra that you didn't intend for it to do.

It seems about half my sessions quickly become "why did you do that? rip __ out and just do ___". Then again, most of the other sessions involve Codex correctly inferring what I wanted without having to be so specific.

elcritch 15 hours ago | parent [-]

Yeah, I tried Claude Code CLI and never found it too useful, but that was Claude 3.5 era. Still using Claude 3.7/4.0 via Cursor were much better but still had to micro managed.

GPT5 + Codex CLI has been pretty productive for me. It's able to get a lot right in a simple prompt without getting too distracted with other crap. It's not perfect, but it's pretty good.

I actually worry GPT5-Codex will make it worse on that aspect though. One of the best parts of GPT5/Codex CLI is that it tends to plan and research first, then make code.

mvieira38 6 hours ago | parent | prev | next [-]

This has been my experience even in Cursor. I often select the GPT-5 option because I know it will "know" better how much reasoning effort it needs

drob518 20 hours ago | parent | prev | next [-]

Yea, I have struggled with Claude to keep it focused on what I want and only what I want. I have no experience with GPT-5-Codex, but maybe I should.

j45 21 hours ago | parent | prev [-]

Sounds like the difference between finding what needs to be done, making a plan, and executing on it remains something to consider and be aware of.

Claude Code has been a revelation and a bit of a let down the past 45 days.

Some open acknowledgement would have been great, but in lieu of it, it seems it's best to hop on a new tool and make sure you learn how to prompt better and not rely on the model to read between until usage is "optimized" and it no longer seems to work for those folks.

I've seen some interesting files that help any model understand a programming language as it's strong suit and it might not even be an expert in and how to best develop with it.

stpedgwdgfhgdd 9 hours ago | parent [-]

Anthropic acknowledged there were bugs that are now resolved, see their status page for latest info:

https://status.anthropic.com/

j45 3 hours ago | parent [-]

The models themselves are responding differently to prior chat requests being run again.

strangescript 15 hours ago | parent | prev | next [-]

Its been better for awhile, people are sleeping on it, just like they slept on claude code when it initially came out.

barrenko 13 hours ago | parent | prev | next [-]

People are using claude code + glm models as alternative too, some complaints flying around.

wahnfrieden a day ago | parent | prev [-]

It is 100% true. And they are rapidly losing users to Codex. Charts were shared recently showing a massive migration underway.

CuriouslyC 19 hours ago | parent | next [-]

Oh yeah, Sonnet performance has been in the toilet for me. They claim they've mitigated it but when 4.0 first dropped CC was really impressive, and now I constantly have to babysit it because any time it hits a challenge it'll just stop trying and make a simple toy version and declare false victory. If I don't catch it and I let it build on top of that bullshit, things get nasty in a hurry.

It's a shame because the plan is a great deal but the number of all caps and profanity laced messages I'm firing off at Claude is too damned high.

resonious 12 hours ago | parent | next [-]

This hits home for me too. Claude feels like it has gotten more "yes-man"-y. I can no longer trust its judgement. Even if I come in with something dead wrong, I'm "absolutely right" and it finds amazing ways to spin my BS into something vaguely believable.

I am also bullying Claude more nowadays. Seeing this thread, I might give Codex another go (I was on Codex CLI before Claude Code. At that time, Claude blew Codex out of the water but something's changed)

dmazin 13 hours ago | parent | prev | next [-]

Yes, this. I feel like I’m going crazy. I pay for the extra Opus usage and I keep checking the model switcher to see if it has automatically switched to Sonnet. It has not. I just have a lot more experiences of it feeling anecdotally dumb lately.

wahnfrieden 16 hours ago | parent | prev [-]

GPT-5 is comparable to Opus without needing to constantly dip back down to Sonnet for cost management

AstroBen 16 hours ago | parent | prev [-]

Am I missing something? That.. doesn't seem to be true? https://npmtrends.com/@anthropic-ai/claude-code-vs-@openai/c...

j45 21 hours ago | parent | prev | next [-]

I wonder if this means part of the prompt has been moved to a higher level somehow... or baked into the bread elsewhere.

groby_b 21 hours ago | parent | prev [-]

Small suggestion on refactors into packages: Move the files manually. Just tell codex "they used to be in different locations, fix it up so it builds".

It seems that the concept of file moving isn't something Codex (and other clis) handle well yet. (Same goes for removing. I've ~never seen success in tracking moves and removes in the git commit if I ask for one)

artemisart 20 hours ago | parent [-]

Does refactoring mean moving things around for people? Why don't you use your IDE for this, it already handles fixing imports (or use find-replace) and it's faster and deterministic.

jumploops 20 hours ago | parent [-]

Not necessarily -- in the case I posted about, we first abstracted some common functionality to internal libs, and then further abstracted that functionality into a number of packages (so they could be used by other clients).

So it was part simplification (dedupe+consolidate), and part moving files around.