Remix.run Logo
inerte 5 hours ago

Codex has always been better at following agents.md and prompts more, but I would say in the last 3 months both Claude Code got worse (freestyling like we see here) and Codex got EVEN more strict.

80% of the time I ask Claude Code a question, it kinda assumes I am asking because I disagree with something it said, then acts on a supposition. I've resorted to append things like "THIS IS JUST A QUESTION. DO NOT EDIT CODE. DO NOT RUN COMMANDS". Which is ridiculous.

Codex, on the other hand, will follow something I said pages and pages ago, and because it has a much larger context window (at least with the setup I have here at work), it's just better at following orders.

With this project I am doing, because I want to be more strict (it's a new programming language), Codex has been the perfect tool. I am mostly using Claude Code when I don't care so much about the end result, or it's a very, very small or very, very new project.

hun3 a few seconds ago | parent | next [-]

[delayed]

0xbadcafebee 15 minutes ago | parent | prev | next [-]

This is mostly dependent on the agent because the agent sets the system prompt. All coding agents include in the system prompt the instruction to write code, so the model will, unless you tell it not to. But to what extent they do this depends on that specific agent's system prompt, your initial prompt, the conversation context, agent files, etc.

If you were just chatting with the same model (not in an agent), it doesn't write code by default, because it's not in the system prompt.

kace91 5 hours ago | parent | prev | next [-]

>I've resorted to append things like "THIS IS JUST A QUESTION. DO NOT EDIT CODE. DO NOT RUN COMMANDS". Which is ridiculous.

Funny to read that, because for me it's not even new behavior. I have developed a tendency to add something like "(genuinely asking, do not take as a criticism)".

I'm from a more confrontational culture, so I just assumed this was just corporate American tone framing criticism softly, and me compensating for it.

lubujackson 4 hours ago | parent | prev | next [-]

I feel like people are sleeping on Cursor, no idea why more devs don't talk about it. It has a great "Ask" mode, the debugging mode has recently gotten more powerful, and it's plan mode has started to look more like Claude Code's plans, when I test them head to head.

AlotOfReading 4 hours ago | parent | prev | next [-]

I've had some luck taming prompt introspection by spawning a critic agent that looks at the plan produced by the first agent and vetos it if the plan doesn't match the user's intentions. LLMs are much better at identifying rule violations in a bit of external text than regulating their own output. Same reason why they generate unnecessary comments no matter how many times you tell them not to.

thomaslord 2 hours ago | parent | prev | next [-]

This is extra rough because Codex defaults to letting the model be MUCH more autonomous than Claude Code. The first time I tried it out, it ended up running a test suite without permission which wiped out some data I was using for local testing during development. I still haven't been able to find a straight answer on how to get Codex to prompt for everything like Claude Code does - asking Codex gets me answers that don't actually work.

clarus 4 hours ago | parent | prev | next [-]

The solution for this might be to add a ME.md in addition to AGENT.md so that it can learn and write down our character, to know if a question is implicitly a command for example.

chrysoprace 2 hours ago | parent | prev | next [-]

Maybe I should give Codex a go, because sometimes I just want to ask a question (Claude) and not have it scan my entire working directory and chew up 55k tokens.

hrimfaxi 5 hours ago | parent | prev | next [-]

> Codex, on the other hand, will follow something I said pages and pages ago, and because it has a much larger context window (at least with the setup I have here at work), it's just better at following orders.

Can you speak more to that setup?

stavros 4 hours ago | parent | prev | next [-]

I've added an instruction: "do not implement anything unless the user approves the plan using the exact word 'approved'".

This has fixed all of this, it waits until I explicitly approve.

parhamn 5 hours ago | parent | prev | next [-]

I added an "Ask" button my agent UI (openade.ai) specifically because of this!

darkoob12 5 hours ago | parent | prev | next [-]

This is not Claude Code. And my experience is the opposite. For me Codex is not working at all to the point that it's not better than asking the chat bot in the browser.

casey2 4 hours ago | parent | prev | next [-]

For the last 12 months labs have been 1. check-pointing 2. train til model collapse 3. revert to the checkpoint from 3 months ago 4. People have gotten used to the shitty new model Antropic said they "don't do any programming by hand" the last 2 years. Antropic's API has 2 nines

cmrdporcupine 4 hours ago | parent | prev [-]

I'm back on Claude Code this month after a month on Codex and it's a serious downgrade.

Opus 4.6 is a jackass. It's got Dunning-Kruger and hallucinates all over the place. I had forgotten about the experience (as in the Gist above) of jamming on the escape key "no no no I never said to do that." But also I don't remember 4.5 being this bad.

But GPT 5.3 and 5.4 is a far more precise and diligent coding experience.