| ▲ | ants_everywhere 3 days ago |
| I've been trying Claude Code for a few weeks after using Gemini Cli. There's something a little better the tool use loop, which is nice. But Claude seems a little dumber and is aggressive about "getting things done", often ignoring common sense or explicit instructions or design information. If I tell it to make a test pass, it will sometimes change my database structure to avoid having to debug the test. At least twice it deleted protobufs from my project and replaced it with JSON because it struggled to immediately debug a proto issue. |
|
| ▲ | adregan 2 days ago | parent | next [-] |
| I’ve seen Claude code get halfway through a small sized refactor (function parameters changed shape or something like that), say something that looks like frustration at the amount of time it’s taking, revert all of the good changes, and start writing a bash script to automate the whole process. In that case, you have put a stop to it and point out that it would already be done if it hadn’t decided to blow it all up in an effort to write a one time use codemod. Of course it agrees with that point as it agrees with everything. It’s the epitome of strong opinions loosely held. |
|
| ▲ | maronato 2 days ago | parent | prev | next [-] |
| Claude trying to cheat its way through tests has been my experience as well. Often it’ll delete or skip them and proudly claim all issues have been fixed. This behavior seems to be intrinsic to it since it happens with both Claude Code and Cursor. Interestingly, it’s the only LLM I’ve seen behave that way. Others simply acknowledge the failure and, after a few hints, eventually get everything working. Claude just hopes I won’t notice its tricks. It makes me wonder what else it might try to hide when misalignment has more serious consequences. |
|
| ▲ | animex 2 days ago | parent | prev | next [-] |
| I just had the same thing happen. Some comprehensive tests were failing, and it decide to write a simple test instead rather than investigate why these more complicated tests were failing. I wonder if the team is trying to save compute by urging it to complete tasks more quickly! Claude seems to be under a compute crunch as often I get API timeouts/errors. |
|
| ▲ | jonstewart 3 days ago | parent | prev | next [-] |
| The hilarious part I’ve found is that when it runs into the least bit of trouble with a step on one of its plans, it will say it has been “Deferred” and then make up an excuse for why that’s acceptable. It is sometimes acceptable for humans to use judgment and defer work; the machine doesn’t have judgment so it is not acceptable for it to do so. |
| |
| ▲ | physix 2 days ago | parent | next [-] | | Talking about hilarious, we had a Close Encounter of the Hallucinating Kind today. We were having mysterious simultaneous gRPC socket-closed exceptions on the client and server side running in Kubernetes talking to each other through an nginx ingress. We captured debug logs, described the detailed issue to Gemini 2.5 Flash giving it the nginx logs for the one second before and after an example incident, about 10k log entries. It came back with a clear verdict, saying "The smoking gun is here:
2025/07/24 21:39:51 [debug] 32#32: *5902095 rport:443 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.233.100.128, server: grpc-ai-test.not-relevant.org, request: POST /org.not-relevant.cloud.api.grpc.CloudEventsService/startStreaming HTTP/2.0, upstream: grpc://10.233.75.54:50051, host: grpc-ai-test.not-relevant.org" and gave me a detailed action plan. I was thinking this is cool, don't need to use my head on this, until I realized that the log entry simply did not exist. It was entirely made up. (And yes I admit, I should know better than to do lousy prompting on a cheap foundation model) | |
| ▲ | quintu5 2 days ago | parent | prev | next [-] | | My favorite is when you ask Claude to implement two requirements and it implements the first, gets confused by the the second, removes the implementation for the first to “focus” on the second, and then finishes by having implemented nothing. | | |
| ▲ | theshrike79 2 days ago | parent | next [-] | | This is why you ask it to do one thing at a time. Then clear the context and move on to the next task. Context pollution is real and can hurt you. | |
| ▲ | fragmede 2 days ago | parent | prev | next [-] | | After the first time that happened, why would you continue to ask it to do two things at once? | |
| ▲ | aaronbrethorst 2 days ago | parent | prev [-] | | The implementation is now enterprise grade with robust security, :rocketship_emoji: |
| |
| ▲ | ants_everywhere 2 days ago | parent | prev | next [-] | | Oh yeah totally. It feels a bit deceptive sometimes. Like just now it says "great the tests are consistently passing!" So I ran the same test command and 4 of the 7 tests are so broken they don't even build. | | |
| ▲ | enobrev 2 days ago | parent [-] | | I've noticed in the "task complete" summaries, I'll see something like "250/285 tests passing, but the broken tests are out of scope for this change". My immediate and obvious response is "you broke them!" (at least to myself), but I do appreciate that it's trying to keep focused in some strange way. A simple "commit, fix failing tests" prompt will generally take care of it. I've been working on my "/implement" command to do a better job of checking that the full test suite is all green before asking if I want to clear the task and merge the feature branch |
| |
| ▲ | stkdump 2 days ago | parent | prev | next [-] | | Well I would say that the machine should not override the human input. But if the machine makes up the plans in the first place, then why should it not be allowed to change the plans? I think that the hilarious part in modifying tests to make them work without understanding why they fail is that it probably happens due to training from humans. | |
| ▲ | mattigames 2 days ago | parent | prev [-] | | "This task seems more appropriate for lesser beings e.g. humans" |
|
|
| ▲ | Fade_Dance 2 days ago | parent | prev [-] |
| I even heard that it will aggressively delete your codebase and then lie about it. To your face. |
| |