Remix.run Logo
hmottestad 10 days ago

Been playing with Codex CLI the past week and it really loves to create a fix for a bug by adding a special case for just that bug in the code. It couldn't see the patterns unless I pointed them out and asked it to create new abstractions.

It would just keep adding what it called "heuristics", which were just if statements that tested for a specific condition that arose during the bug. I could write 10 tests for a specific type of bug, and it would happily fix all of them. When I add another one test with the same kind of bug it obviously fails, because the fix that Codex came up with was a bunch of if statements that matched the first 10 tests.

xyzzy123 10 days ago | parent | next [-]

Also they hedge a lot, will try doing things one way, have a catch / error handler and then try a completely different way - only one of them can right but it just doesn't care. Have to lean hard to get it to check which paths are actually used and delete the others.

I am convinced this behaviour and the one you described are due to optimising for swe benchmarks that reward 1-shotting fixes without regard to quality. Writing code like this makes complete sense in that context.

mewpmewp2 10 days ago | parent [-]

That's a really good point. I was wondering why some of the LLMs were trained to try to pass things so sloppily constantly. Writing mock data, methods and pretending as if the task is complete and everything is great, good to go. They do seem to be trained just to pass some sort of conditions sadly and it feels somehow to me that it has got worse as of late. It should be relatively easy to reward them for writing robust code even if it takes longer or won't work, but it does seem they are geared towards getting high swe benchmarks.

Buttons840 10 days ago | parent | prev [-]

It's clear that these AIs are approaching human level intelligence. (:

Thank you for giving a perfect example of what I was describing.

The thing is, you actually can make the software work this way, you just have to add enough if-statements to handle all cases--or rather, enough cases that the manager is happy.