Remix.run Logo
xyzzy123 10 days ago

Also they hedge a lot, will try doing things one way, have a catch / error handler and then try a completely different way - only one of them can right but it just doesn't care. Have to lean hard to get it to check which paths are actually used and delete the others.

I am convinced this behaviour and the one you described are due to optimising for swe benchmarks that reward 1-shotting fixes without regard to quality. Writing code like this makes complete sense in that context.

mewpmewp2 10 days ago | parent [-]

That's a really good point. I was wondering why some of the LLMs were trained to try to pass things so sloppily constantly. Writing mock data, methods and pretending as if the task is complete and everything is great, good to go. They do seem to be trained just to pass some sort of conditions sadly and it feels somehow to me that it has got worse as of late. It should be relatively easy to reward them for writing robust code even if it takes longer or won't work, but it does seem they are geared towards getting high swe benchmarks.