Remix.run Logo
jonstewart 3 days ago

The hilarious part I’ve found is that when it runs into the least bit of trouble with a step on one of its plans, it will say it has been “Deferred” and then make up an excuse for why that’s acceptable.

It is sometimes acceptable for humans to use judgment and defer work; the machine doesn’t have judgment so it is not acceptable for it to do so.

physix 2 days ago | parent | next [-]

Talking about hilarious, we had a Close Encounter of the Hallucinating Kind today. We were having mysterious simultaneous gRPC socket-closed exceptions on the client and server side running in Kubernetes talking to each other through an nginx ingress.

We captured debug logs, described the detailed issue to Gemini 2.5 Flash giving it the nginx logs for the one second before and after an example incident, about 10k log entries.

It came back with a clear verdict, saying

"The smoking gun is here: 2025/07/24 21:39:51 [debug] 32#32: *5902095 rport:443 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.233.100.128, server: grpc-ai-test.not-relevant.org, request: POST /org.not-relevant.cloud.api.grpc.CloudEventsService/startStreaming HTTP/2.0, upstream: grpc://10.233.75.54:50051, host: grpc-ai-test.not-relevant.org"

and gave me a detailed action plan.

I was thinking this is cool, don't need to use my head on this, until I realized that the log entry simply did not exist. It was entirely made up.

(And yes I admit, I should know better than to do lousy prompting on a cheap foundation model)

quintu5 2 days ago | parent | prev | next [-]

My favorite is when you ask Claude to implement two requirements and it implements the first, gets confused by the the second, removes the implementation for the first to “focus” on the second, and then finishes by having implemented nothing.

theshrike79 2 days ago | parent | next [-]

This is why you ask it to do one thing at a time.

Then clear the context and move on to the next task. Context pollution is real and can hurt you.

fragmede 2 days ago | parent | prev | next [-]

After the first time that happened, why would you continue to ask it to do two things at once?

aaronbrethorst 2 days ago | parent | prev [-]

The implementation is now enterprise grade with robust security, :rocketship_emoji:

ants_everywhere 3 days ago | parent | prev | next [-]

Oh yeah totally. It feels a bit deceptive sometimes.

Like just now it says "great the tests are consistently passing!" So I ran the same test command and 4 of the 7 tests are so broken they don't even build.

enobrev 2 days ago | parent [-]

I've noticed in the "task complete" summaries, I'll see something like "250/285 tests passing, but the broken tests are out of scope for this change".

My immediate and obvious response is "you broke them!" (at least to myself), but I do appreciate that it's trying to keep focused in some strange way. A simple "commit, fix failing tests" prompt will generally take care of it.

I've been working on my "/implement" command to do a better job of checking that the full test suite is all green before asking if I want to clear the task and merge the feature branch

stkdump 2 days ago | parent | prev | next [-]

Well I would say that the machine should not override the human input. But if the machine makes up the plans in the first place, then why should it not be allowed to change the plans? I think that the hilarious part in modifying tests to make them work without understanding why they fail is that it probably happens due to training from humans.

mattigames 2 days ago | parent | prev [-]

"This task seems more appropriate for lesser beings e.g. humans"