I have the same experience despite using claude every day. As an funny anecdote:

Someone I know wrote the code and the unit tests for a new feature with an agent. The code was subtly wrong, fine, it happens, but worse the 30 or so tests they added added 10 minutes to the test run time and they all essentially amounted to `expect(true).to.be(true)` because the LLM had worked around the code not working in the tests

▲

monooso 20 hours ago | parent | next [-]

There was an article on HN last week (?) which described this exact behaviour in the newer models.

Older, less "capable", models would fail to accomplish a task. Newer models would cheat, and provide a worthless but apparently functional solution.

Hopefully someone with a larger context window than myself can recall the article in question.

	▲	SatvikBeri 20 hours ago \| parent [-]
		I think that article was basically wrong. They asked the agent not to provide any commentary, then gave an unsolvable task, and wanted the agent to state that the task was impossible. So they were basically testing which instructions the agent would refuse to follow. Purely anecdotally, I've found agents have gotten much better at asking clarifying questions, stating that two requirements are incompatible and asking which one to change, and so on. https://spectrum.ieee.org/ai-coding-degrades

▲

sReinwald 20 hours ago | parent | prev | next [-]

From my experience: TDD helps here - write (or have AI write) tests first, review them as the spec, then let it implement.

But when I use Claude code, I also supervise it somewhat closely. I don't let it go wild, and if it starts to make changes to existing tests it better have a damn good reason or it gets the hose again.

The failure mode here is letting the AI manage both the implementation and the testing. May as well ask high schoolers to grade their own exams. Everyone got an A+, how surprising!

	▲	edude03 19 hours ago \| parent [-]
		> TDD helps here - write (or have AI write) tests first, review them as the spec I agree, although I think the problem usually comes in writing the spec in the first place. If you can write detailed enough specs the agent will usually give you exactly what you asked for. If you're spec is vague, it's hard to eyeball if the tests or even the implementation of the tests matches what you're looking for.

▲

jermaustin1 20 hours ago | parent | prev | next [-]

This happens with me every time I try to get claude to write tests. I've given up on it. Instead I will write the tests if I really care enough to have tests.

▲

antonvs 20 hours ago | parent | prev [-]

> they all essentially amounted to `expect(true).to.be(true)` because the LLM had worked around the code not working in the tests

A very human solution

	▲	netsharc 12 hours ago \| parent [-]
		I wonder if Volkswagen would've blamed AI if they got caught with Dieselgate nowadays... In PR-lese: "To improve quality and reduce costs, we used AI to program some test code. Unfortunately the test code the AI generated fell below our standards, and it was missed during QA.". Then again they got their supplier Bosch to program the "defeat device" and lied to them that "Oh don't worry, it's just for testing, we won't deploy it to production". (The "device" (probably just an algorithm) detects whether the steering wheel was being moved or not as the throttle is pushed, and if not, it assumes the car was undergoing emissions testing, and it runs the engine in the environmentally friendlier mode).