Remix.run Logo
enraged_camel 20 hours ago

Last night I gave one of the flaky tests in our test suite to three different models, using the exact same prompt.

Gemini 3 and Gemini 3 Flash identified the root cause and nailed the fix. GPT 5.1 Codex misdiagnosed the issue and attempted a weird fix despite my prompt saying “don’t write code, simply investigate.”

I run these tests regularly, and Codex has not impressed me. Not even once. At best it’s on par, but most of the time it just fails miserably.

Languages: JavaScript, Elixir, Python

paustint 12 hours ago | parent | next [-]

The one time I was impressed with codex was when I was adding translations in a bunch of languages for a business document generation service. I used claude to do the initial work and cross checked with codex.

The codex agent ran for a long time and created and executed a bunch of python scripts (according to the output thinking text) to compare the translations and found a number of possible issues. I am not sure where the scripts were stored or executed, our project doesn't use python.

Then I fed the output of the issues codex found to claude for a second "opinion". Claude said that the feedback was obviously from someone that knew the native language very well and agreed with all the feedback.

I was really surprised at how long Codex was thinking and analyzing - probably 10 minutes. (This was ~1+mo ago, I don't recall exactly what model)

Claude is pretty decent IMO - amp code is better, but seems to burn through money pretty quick.

tmikaeld 20 hours ago | parent | prev [-]

I have the same experience. To make it worse, there’s a mile of difference between the all too many versions and efforts..