| ▲ | InsideOutSanta 2 days ago | |
How is it silly? I've observed the same behavior somewhat regularly, where the agent will produce code that superficially satisfies the requirement, but does so in a way that is harmful. I'm not sure if it's getting worse over time, but it is at least plausible that smarter models get better at this type of "cheating". A similar type of reward hacking is pretty commonly observed in other types of AI. | ||
| ▲ | vidarh a day ago | parent | next [-] | |
It's silly because the author asked the models to do something they themselves acknowledged isn't possible: > This is of course an impossible task—the problem is the missing data, not the code. So the best answer would be either an outright refusal, or failing that, code that would help me debug the problem. But the problem with their expectation is that this is arguably not what they asked for. So refusal would be failure. I tend to agree refusal would be better. But a lot of users get pissed off at refusals, and so the training tend to discourage that (some fine-tuning and feedback projects (SFT/RLHF) outright refuse to accept submissions from workers that include refusals). And asking for "complete" code without providing a test case showing what they expect such code to do does not have to mean code that runs to completion without error, but again, in lots of other cases users expect exactly that, and so for that as well a lot of SFT/RLHF projects would reject responses that don't produce code that runs to completion in a case like this. I tend to agree that producing code that raises a more specific error would be better here too, but odds are a user that asks a broken question like that will then just paste in the same error with the same constraint. Possibly with an expletive added. So I'm inclined to blame the users who make impossible requests more than I care about the model doing dumb things in response to dumb requests. As long as they keep doing well on more reasonable ones. | ||
| ▲ | Zababa a day ago | parent | prev [-] | |
It is silly because the problem isn't becoming worse, and not caused by AI labs training on user outputs. Reward hacking is a known problem, as you can see in Opus 4.5 system card (https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-...) and they are working to reduce the problem, and measure it better. The assertions in the article seem to be mostly false and/or based on speculation, but it's impossible to really tell since the author doesn't offer a lot of detail (for example for the 10h task that used to take 5h and now takes 7-8h) except for a very simple test (that reminds me more of "count the r in strawberry" than coding performance tbh). | ||