| ▲ | amluto a day ago | |||||||
Is it? This week I asked GPT-5.2 to debug an assertion failure in some code that worked on one compiler but failed on a different compiler. I went through several rounds of GPT-5.2 suggesting almost-plausible explanations, and then it modified the assertion and gave a very confident-sounding explanation of why it was reasonable to do so, but the new assertion didn’t actually check what the old assertion checked. It also spent an impressive of time arguing, entirely incorrectly and based in flawed reasoning that I don’t really think it found in its training set, as to why it wasn’t wrong. I finally got it to answer correctly by instructing it that it was required to identify the exact code generation difference that caused the failure. I haven’t used coding models all that much, but I don’t think the older ones would have tried so hard to cheat. This is also consistent with reports of multiple different vendors’ agents figuring out how to appear to diagnose bugs by looking up the actual committed fix in the repository. | ||||||||
| ▲ | efficax a day ago | parent [-] | |||||||
they all do this at some point. claude loves to delete tests that are failing if it can't fix them. or delete code that won't compile if it can't figure it out | ||||||||
| ||||||||