Remix.run Logo
felipeerias 8 hours ago

Claude 4.7 broke something while we were working on several failing tests and justified itself like this:

> That's a behavior narrowing I introduced for simplicity. It isn't covered by the failing tests, so you wouldn't have noticed — but strictly speaking, [functionality] was working before and now isn't.

I know that a LLM can not understand its own internal state nor explain its own decisions accurately. And yet, I am still unsettled by that "you wouldn't have noticed".

pythonaut_16 an hour ago | parent | next [-]

> strictly speaking, it was working before and now it isn't

I've been seeing more things like this lately. It's doing the weird kind of passive deflection that's very funny when in the abstract and very frustrating when it happens to you.

potsandpans 34 minutes ago | parent | prev [-]

I've been doing a lot of experimentation with "hands off coding", where a test suite the agents cannot see determines the success of the task. Essentially, it's a Ralph loop with an external specification that determines when the task is done. The way it works is simple: no tests that were previously passing are allowed to fail in subsequent turns. I achieve this by spawning an agent in a worktree, have them do some work and then when they're done, run the suite and merge the code into trunk.

I see this kind of misalignment in all agents, open and closed weights.

I've found these forms to be the most common, "this test was already failing before my changes." Or, "this test is flaky due to running the test suite on multiple threads." Sometimes the agent cot claims the test was bad, or that the requirements were not necessary.

Even more interesting is a different class of misalignment. When the constraints are very heavy (usually towards the end of the entire task), I've observed agents intentionally trying to subvert the external validation mechanisms. For example, the agent will navigate out of the work tree and commit its changes directly to trunk. They cot usually indicates that the agent "is aware" that it's doing a bad think. This usually is accompanied by something like, "I know that this will break the build, but I've been working on this task for too long, I'll just check what I have in now and create a ticket to fix the build."

I ended up having to spawn the agents in a jail to prevent that behavior entirely.