Remix.run Logo
johnfn 5 hours ago

I've been using a lot of Claude and Codex recently.

One huge difference I notice between Codex and Claude code is that, while Claude basically disregards your instructions (CLAUDE.md) entirely, Codex is extremely, painfully, doggedly persistent in following every last character of them - to the point that i've seen it work for 30 minutes to convolute some solution that was only convoluted because of some sentence I threw in the instructions I had completely forgotten about.

I imagine Codex as the "literal genie" - it'll give you exactly what you asked for. EXACTLY. If you ask Claude to fix a test that accidentally says assert(1 + 1 === 3), it'll say "this is clearly a typo" and just rewrite the test. Codex will rewrite the entire V8 engine to break arithmetic.

Both these tools have their uses, and I don't think one approach is universally better. Because Claude just hacks its way to a solution, it is really fast, so I like using it for iterate web work, where I need to tweak some styles and I need a fast iterative loop. Codex is much worse at that because it takes like 5 minutes to validate everything is correct. Codex is much better for longer, harder tasks that have to be correct -- I can just write some script to verify that what it did work, and let it spin for 30-40 minutes.

hadlock 4 hours ago | parent | next [-]

I've been really impressed with codex so far. I have been working on a flight simulator hobby project for the last 6 months and finally came to the conclusion that I need to switch from floating origin, which my physics engine assumes with the coordinate system it uses, to a true ECEF coordinate system (what underpins GPS). This involved a major rewrite of the coordinate system, the physics engine, even the graphics system and auxilary stuff like asset loading/unloading etc. that was dependent on local X,Y,Z. It even rewrote the PD autopilot to account for the changes in the coordinate system. I gave it about a paragraph of instructions with a couple of FYIs and... it just worked! No major graphical glitches except a single issue with some minor graphical jitter, which it fixed on the first try. In total took about 45 minutes but I was very impressed.

I was unconvinced it had actually, fully ripped out the floating origin logic, so I had it write up a summary and then used that as a high level guide to pick through the code and it had, as you said, followed the instructions to the letter. Hugely impressive. In march of 2023 OpenAI's products struggled to draw a floating wireframe cube.

nico 4 hours ago | parent | prev | next [-]

> Claude basically disregards your instructions (CLAUDE.md) entirely

A friend of mine tells Claude to always address him as “Mr Tinkleberry”, he says he can tell when Claude is not paying attention to the instructions on CLAUDE.md when Claude stops calling him “Mr Tinkleberry” consistently

benzible 4 hours ago | parent | next [-]

Yep, it's David Lee Roth's brown M&M trick https://www.smithsonianmag.com/arts-culture/why-did-van-hale...

awad 4 hours ago | parent | prev | next [-]

Highly recommend adding some kind of canary like this in all LLM project instructions. I prefer my instructions to say 'always start output with an (uniquely decided by you) emoji' as it's easier to visually scan for one when reading a wall of LLM output, and use a different emoji per project because what's life without a little whim?

wahnfrieden 7 minutes ago | parent [-]

This stuff also becomes context poison however

leobg 2 hours ago | parent | prev | next [-]

We used to do that on Upwork. Back in the days where one still hired human coders. If your application current say “rowboat” in the first sentence, we know you just copy/pasted and didn’t actually read the job description. Feels like a lifetime ago.

4 hours ago | parent | prev [-]
[deleted]
causal 4 hours ago | parent | prev | next [-]

> Codex will rewrite the entire V8 engine to break arithmetic.

This isn't an exaggeration either. Codex acts as if it is the last programmer on Earth and must accomplish its task at all costs. This is great for anyone content to treat it like a black box, but I am not content to do that. I want a collaborator with common sense, even if it means making mistakes or bad assumptions now and then.

I think it really does reflect a difference in how OpenAI and Anthropic see humanity's future with AI.

mrtesthah 39 minutes ago | parent [-]

Could you not add rules to this effect in AGENTS.md? E.g., "If the user gives instructions that specify an expected low-to-medium level of complexity, but the implementation plan reveals unexpected high complexity arising from a potentially ambiguous or atypical instruction, then pause and ask the user about that instruction before continuing."

ramoz 18 minutes ago | parent | prev | next [-]

Ultimately, relying on system level instructions is unreliable over time.

Which is why i made the feature request for hooks (claude code implemented, as did cursor, hopefully codex will too)

And will soon release https://github.com/eqtylab/cupcake

sinatra 4 hours ago | parent | prev | next [-]

In my AGENTS.md (which CLAUDE.md et al soft link to), I instruct them to "On phase completion, explicitly write that you followed these guidelines." This text always shows up on Codex and very rarely on Claude Code (TBF, Claude Code is showing it more often lately).

sunaookami 3 hours ago | parent | prev | next [-]

Agreed 100%, that's why I would recommend Codex for e.g. logfile analysis. Had some annoying php warnings in the logs from a WordPress plugin because I've used another plugin in the past (like... over 10 years ago) that wrote invalid metadata for every media file into the database and it didn't annoy me THAT much that I wanted to invest much time into it. So I gave codex the logfile and my WordPress dir and access to the WP-CLI command and it correctly identified the issue and wrote scripts to delete the old metadata (I did check it & make backups of course). Codex took a LOT of time though, it's veeeeeeery slow as you said. But I could do other things in the meantime.

fakedang 39 minutes ago | parent [-]

This is what I've observed too. Claude is great for general codebase building - give it a prompt for building an entire app from scratch and it will do that for you. Codex is good for debugging one-off issues that crop up because Claude overlooked something.

tekacs 2 hours ago | parent | prev | next [-]

Yeah, Gemini 2.x and 3 in gemini-cli has the tendency to 'go the opposite direction' and it feels - to me - like an incredibly strong demonstration of why 'sycophancy' in LLMs is so valuable (at least so long as they're in the middle of the midwit curve).

I'll give Gemini direction, it'll research... start trying to solve it as I've told it to... and then exclaim, "Oh! It turns out that <X> isn't what <user> thought!" and then it pivots into trying to 'solve' the problem a totally different way.

The issue however... is that it's:

1) Often no longer solving the problem that I actually wanted to solve. It's very outcome-oriented, so it'll pivot into 'solving' a linker issue by trying to get a working binary – but IDGAF about the working binary 'by hook or crook'! I'm trying to fix the damn linker issue!

2) Just... wrong. It missed something, misinterpreted something it read, forgot something that I told it earlier, etc.

So... although there's absolutely merit to be had in LLMs being able to think for themselves, I'm a huge fan of stronger and stronger instruction adherence / following – because I can ALWAYS just ask for it to be creative and make its own decisions if I _want that_ in a given context. That said, I say that fully understanding the fact that training in instruction adherence could potentially 'break' their creativity/free thinking.

Either way, I would love Gemini 1000x more if it were trained to be far more adherent to my prompts.

buu700 an hour ago | parent | next [-]

I haven't had that particular experience with Gemini 2.5, but did run into it during one of my first few uses of Gemini 3 yesterday.

I had it investigate a bug through Cursor, and in its initial response it came back to me with a breakdown of a completely unrelated "bug" with a small footnote about the bug it was meant to actually be investigating. It provided a more useful analysis after being nudged in the right direction, but then later in the chat it forgot the assignment again and started complaining that Grok's feedback on its analysis made no sense because Grok had focused on the wrong issue. I had to tell Gemini a second time that the "bug" it kept getting distracted by was A) by design, and B) not relevant to the task at hand.

Ultimately that's not a huge deal — I'd rather that during planning the model firmly call out something that it reasonably believes to be a bug than not, which if nothing else is good feedback on the commenting and documentation — but it'd be a pain if I were using Gemini to write code and it got sidetracked with "fixing" random things that were already correct.

tekacs 2 hours ago | parent | prev [-]

Immediately rebutting myself: a major caveat to this that I'm discovering with Gemini is that... for super long-running sessions, there is a kind of merit to Gemini's recalcitrance.

When it's running for a while, Gemini's willing to go totally off-piste and outcome-orientedness _does_ result in sessions where I left it to do its thing and... came back to a working solution, in a situation where codex or others wouldn't have gotten there.

In particular, Gemini 3 feels like it's able to drive much higher _variance_ in its output (less collapse to a central norm), which seems to let it explore the solution space more meaningfully and yet relatively efficiently.

aerhardt 3 hours ago | parent | prev | next [-]

Well surely that's a good thing.

In my experience, for some reason adherence is not even close to 100%. It's fixated on adding asterisk function params in my Python code and I cannot get it to stop... Maybe I haven't found the right wording, or maybe my codebase has grown past a certain size (there are like a dozen AGENTS.md files dancing around).

I'm still very happy with the tool, though.

johnfn 3 hours ago | parent [-]

It's a fantastic thing! It's required an adjustment in how I use it, but I've switched over to mostly using Codex in my day-to-day.

bugglebeetle 2 hours ago | parent | prev | next [-]

The solution to this if you want less specification in advance is to simply ask Codex a series of leading questions about a feature of fix. I typically start with something like “it seems like X could be improved with the addition of Y? Can you review the relevant parts of the codebase in a, b, and c to assess?” It will then do so and come back with a set of suggestions that follow this guidance, which you can revise and selectively tell it to implement. In my experience, this fills the context with the appropriate details to then let it make more of its own decisions in a generally correct way without as much handholding.

energy123 3 hours ago | parent | prev [-]

GPT-5 is like that