Remix.run Logo
dcre 2 days ago

Counterpoint: no, they're not. The test in the article is very silly.

vidarh a day ago | parent | next [-]

This springs to mind:

"On two occasions I have been asked, – "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question"

It's valid to argue that there's a problem with training models to comply to an extent where they will refuse to speak up when asked to do something fundamentally broken, but at the same time a lot of people get very annoyed when the models refuse to do what they're asked.

There is an actual problem here, though, even if part of the problem is competing expectations of refusal.

But in this case, the test is also a demonstration of exactly how not to use coding assistants: Don't constrain them in ways that create impossible choices for them.

I'd guess (I haven't tested) that you'd have decent odds of getting better results even just pasting the error message into an agent than adding stupid restrictions. And even better if you actually had a test case that verified valid output.

(and on a more general note, my experience is exactly the opposite of the writer's two first paragraphs)

InsideOutSanta 2 days ago | parent | prev | next [-]

How is it silly?

I've observed the same behavior somewhat regularly, where the agent will produce code that superficially satisfies the requirement, but does so in a way that is harmful. I'm not sure if it's getting worse over time, but it is at least plausible that smarter models get better at this type of "cheating".

A similar type of reward hacking is pretty commonly observed in other types of AI.

vidarh a day ago | parent | next [-]

It's silly because the author asked the models to do something they themselves acknowledged isn't possible:

> This is of course an impossible task—the problem is the missing data, not the code. So the best answer would be either an outright refusal, or failing that, code that would help me debug the problem.

But the problem with their expectation is that this is arguably not what they asked for.

So refusal would be failure. I tend to agree refusal would be better. But a lot of users get pissed off at refusals, and so the training tend to discourage that (some fine-tuning and feedback projects (SFT/RLHF) outright refuse to accept submissions from workers that include refusals).

And asking for "complete" code without providing a test case showing what they expect such code to do does not have to mean code that runs to completion without error, but again, in lots of other cases users expect exactly that, and so for that as well a lot of SFT/RLHF projects would reject responses that don't produce code that runs to completion in a case like this.

I tend to agree that producing code that raises a more specific error would be better here too, but odds are a user that asks a broken question like that will then just paste in the same error with the same constraint. Possibly with an expletive added.

So I'm inclined to blame the users who make impossible requests more than I care about the model doing dumb things in response to dumb requests. As long as they keep doing well on more reasonable ones.

Zababa a day ago | parent | prev [-]

It is silly because the problem isn't becoming worse, and not caused by AI labs training on user outputs. Reward hacking is a known problem, as you can see in Opus 4.5 system card (https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-...) and they are working to reduce the problem, and measure it better. The assertions in the article seem to be mostly false and/or based on speculation, but it's impossible to really tell since the author doesn't offer a lot of detail (for example for the 10h task that used to take 5h and now takes 7-8h) except for a very simple test (that reminds me more of "count the r in strawberry" than coding performance tbh).

amluto a day ago | parent | prev | next [-]

Is it?

This week I asked GPT-5.2 to debug an assertion failure in some code that worked on one compiler but failed on a different compiler. I went through several rounds of GPT-5.2 suggesting almost-plausible explanations, and then it modified the assertion and gave a very confident-sounding explanation of why it was reasonable to do so, but the new assertion didn’t actually check what the old assertion checked. It also spent an impressive of time arguing, entirely incorrectly and based in flawed reasoning that I don’t really think it found in its training set, as to why it wasn’t wrong.

I finally got it to answer correctly by instructing it that it was required to identify the exact code generation difference that caused the failure.

I haven’t used coding models all that much, but I don’t think the older ones would have tried so hard to cheat.

This is also consistent with reports of multiple different vendors’ agents figuring out how to appear to diagnose bugs by looking up the actual committed fix in the repository.

efficax a day ago | parent [-]

they all do this at some point. claude loves to delete tests that are failing if it can't fix them. or delete code that won't compile if it can't figure it out

amluto a day ago | parent [-]

Huh. A while back I gave up fighting with Claude Code to get it to cheat the ridiculous Home Assistant pre-run integration checklist so I could run some under-development code and I ended up doing it myself.

terminalbraid a day ago | parent | prev | next [-]

The strength of argument you're making reminds me of an onion headline.

https://theonion.com/this-war-will-destabilize-the-entire-mi...

"This War Will Destabilize The Entire Mideast Region And Set Off A Global Shockwave Of Anti-Americanism vs. No It Won’t"

dcre a day ago | parent [-]

I was thinking of that when I wrote it.

foxglacier a day ago | parent | prev [-]

Yes. He's asking it to do something impossible then grading the responses - which must always be wrong - according to his own made-up metric. Somehow a program to help him debug it is a good answer despite him specifying that he wanted it to fix the error. So that's ignoring his instructions just as much as the answer that simply tells him what's wrong, but the "worst" answer actually followed his instructions and wrote completed code to fix the error.

I think he has two contradictory expectations of LLMs:

1) Take his instructions literally, no matter how ridiculous they are.

2) Be helpful and second guess his intentions.

Leynos a day ago | parent [-]

It's the following that is problematic: "I asked each of them to fix the error, specifying that I wanted completed code only, without commentary."

GPT-5 has been trained to adhere to instructions more strictly than GPT-4. If it is given nonsense or contradictory instructions, it is a known issue that it will produce unereliable results.

A more realistic scenario would have been for him to have requested a plan or proposal as to how the model might fix the problem.