I feel like it was doing what you're saying about 4-6 months ago. Especially the commenting out tests. Not always but I'd have to do more things step by step and keep the llm on track. Now though, the last 3-4 months, it's writing decent unit tests without much hand holding or refactors.

▲

joshstrange 2 hours ago | parent [-]

Hmm, my last experience was within the last 2 months but I'm trying not to write it off as "this sucked and will always suck", that's the #1 reason I keep testing and playing with these things, the capabilities are increasing quickly and what did/didn't work last week (especially "last model") might work this week.

I'll keep testing it but that just hasn't been my experience, I sincerely hope that changes because an agent that runs unit test [0] and can write them would be very powerful.

[0] This is a pain point for me. The number of times I've watching Claude run "git commit --no-verify"... I've told it in CLAUDE.md to never bypass commit checks, I've told it in the prompt, I've added it 10 more times in different places in CLAUDE.md but still, the agent will always reach for that if it can't fix something in 1-3 iterations. And yes, I've told it "If you can't get the checks to pass then ask me before bypassing the checks".

It doesn't matter how many guardrails I put up and how good they are if the agent will lazily bypass them at the drop of a hat. I'm not sure how other people are dealing with this (maybe with agents managing agents and checking their work? A la Gas Town?).

▲

kaydub 2 hours ago | parent [-]

I haven't seen your issue, but git is actually one of the things I don't have the llm do.

When I work on issues I create a new branch off of master, let the llm go to town on it, then I manually commit and push to remote for an MR/PR. If there are any errors on the commit hooks I just feed the errors back into the agent.

▲

joshstrange 2 hours ago | parent [-]

Interesting, ok, I might try that on my next attempt. I was trying to have it commit so that I could use pre-commit hooks to enforce things I want (test, lint, prettier, etc) but maybe instead I should handle that myself and make it more explicit in my prompts/CLAUDE.md to test/lint/etc. In reality I should just create a `/prep` command or similar that asks it to do all of that so that once it thinks it's done, I can quickly type that and have it get everything passing/fixed and then give a final report on what it did.

▲

kaydub 18 minutes ago | parent | next [-]

In my projects there's generally a "developer" way to do things and an "llm agent" way to do things.

For the llm a lot of linting and build/test tools go into simple scripts that the llm can run and get shorthand info out of. Some tools, if you have the llm run them, they're going to ingest a lot from the output (like a big stacktrace or something). I want to keep context clean so I have the llm create the tool to use for build/test/linting and I tell it to create it so the outputs will keep its context clean, then I have it document it in the .md file.

When working with the LLM I have to start out pretty explicit about using the tooling. As we work through things it will start to automatically run the tooling. Sometimes it will want to do something else, I just nudge it back to use the tooling (or I'll ask it why or if there are benefits to the other way and if there are we'll rebuild the tooling to use the other way).

Finally, if the LLM is really having trouble, I kill the session and start a new one. It used to feel bad to do that. I'd feel like I'm losing a lot of info that's in context. But now, I feel like it's not so bad... but I'm not sure if that's because the llms are better or if my workflow has adapted.

Now, let me backup a little bit. I mentioned that I don't have the llm use git. That's the control I maintain. And with that my workflow is: llm builds feature->llm runs linters/tests->I e2e test whatever I'm building by deploying to a dev/staging/local env->once verified I commit. Now I will continue that context window/session until I feel like the llm starts fucking up. Then I kill the session and start a new one. I rarely compact, but it does happen and I generally don't fret about it too much.

I try to keep my units of work small and I feel like it does the best when I do. But then I often find myself surprised at how much it can do from a single prompt, so idk. I do understand some of the skepticism because a lot of this stuff sounds "hand-wavy". I'm hoping we all start to hone in on some general more concrete patterns but with it being so non-deterministic I'm not sure if we will. It feels like everyone is using it differently and people are having successes and failures across different things. People where I work LOVE MCPs but I can't stand them. When I use them it always feels like I have to remind the llm that it has an MCP, then it feels like the MCP takes too much context window and sometimes the llm still trips over how to use it.

	▲	joshstrange 9 minutes ago \| parent [-]
		Ok, that's a good tip about separate tools/scripts for the LLM, I did something similar less than a year ago so that I kept lint/test output to a minimum but it was still invoked via git hooks. I'll try again with scripts next time I'm doing this. My hope was to let the agent commit to a branch (with code that passed lint/test/prettier/etc), push it, auto-deploys to preview branches, and then that's where I'd do my e2e/QA and once I was happy I could merge it and it get deployed to the main site.

▲

toraway an hour ago | parent | prev [-]

You’ll likely have the same issue relying on CLAUDE.md instructions to test/lint/etc, mine get ignored constantly to the point of uselessness.

I’m trying to redesign my setup to use hooks now instead because poor adherence to rules files across all the agentic CLIs is exhausting to workaround.

(and no, Opus 4.5 didn’t magically solve this problem to preemptively respond to that reply)

	▲	kaydub 14 minutes ago \| parent [-]
		What do your rules files look like? I wonder if some people are putting in too much into their markdown files of what NOT to do. I hate people saying the llms are just better auto-correct, but in some ways they're right. I think putting in too much "don't do this" is leading the llm down the path to do "this" because you mentioned it at all. The LLM is probabilistically generating it's response based on what you've said and what's in the markdown files, the fact you put some of that stuff in there at all probably increases the probability those things will show up.