| ▲ | enraged_camel 20 hours ago | |
Last night I gave one of the flaky tests in our test suite to three different models, using the exact same prompt. Gemini 3 and Gemini 3 Flash identified the root cause and nailed the fix. GPT 5.1 Codex misdiagnosed the issue and attempted a weird fix despite my prompt saying “don’t write code, simply investigate.” I run these tests regularly, and Codex has not impressed me. Not even once. At best it’s on par, but most of the time it just fails miserably. Languages: JavaScript, Elixir, Python | ||
| ▲ | paustint 12 hours ago | parent | next [-] | |
The one time I was impressed with codex was when I was adding translations in a bunch of languages for a business document generation service. I used claude to do the initial work and cross checked with codex. The codex agent ran for a long time and created and executed a bunch of python scripts (according to the output thinking text) to compare the translations and found a number of possible issues. I am not sure where the scripts were stored or executed, our project doesn't use python. Then I fed the output of the issues codex found to claude for a second "opinion". Claude said that the feedback was obviously from someone that knew the native language very well and agreed with all the feedback. I was really surprised at how long Codex was thinking and analyzing - probably 10 minutes. (This was ~1+mo ago, I don't recall exactly what model) Claude is pretty decent IMO - amp code is better, but seems to burn through money pretty quick. | ||
| ▲ | tmikaeld 20 hours ago | parent | prev [-] | |
I have the same experience. To make it worse, there’s a mile of difference between the all too many versions and efforts.. | ||