This matches what I've seen in practice. The failure mode isn't wrong code - it's code that solves the problem in a way no human would choose. Unnecessary abstractions, ignoring existing patterns in the codebase, fixing the symptom instead of the root cause. Tests pass but the PR makes the codebase worse.

SWE-bench measures "does the patch work" but the actual bar for merging is "does this look like something a team member wrote who understands the project."

▲

tripdout 2 hours ago | parent | next [-]

Is this, along with the comments by the other green usernames on this post, an AI-generated comment? Apologies if it isn't, AIs are trained on human writing and all that, but they're jumping out at me.

Edit: I see another green comment was flagged for AI, might be indicative of something, but why so many green comments on this thread specifically?

	▲	aarjaneiro an hour ago \| parent [-]
		Green username just means new user (under 1 month iirc)

▲

bckygldstn an hour ago | parent | prev | next [-]

* Dashes

* Triplets

* X isn't Y, it's Z

* X but Y

* Wording that looks good at first pass, but when you read closely actually makes no sense in the context of the discussion: "fixing the symptom instead of the root cause"

Flagged.

	▲	Toutouxc a minute ago \| parent [-]
		> makes no sense in the context of the discussion: "fixing the symptom instead of the root cause" What's wrong with that?

▲

bandrami 3 hours ago | parent | prev | next [-]

I'm very much an AI bear but I do think one interesting outcome is going to be that LLMs will stumble upon some weird ways of doing things that no human would have chosen that turn out to be better (Duff's device-level stuff) and they will end up entering the human programming lexicon.

▲

omcnoe an hour ago | parent | prev | next [-]

These are the same kinds of issues often seen with human junior engineer work.

▲

ukuina 2 hours ago | parent | prev | next [-]

Lints, beautifiers, better tests?

▲

colechristensen 2 hours ago | parent | prev [-]

Eh, but if you're in an organization you tune your AGENTS.md, CLAUDE.md, AI code reviews, etc. to have your human driven or automated AI generated code fit the standards of your organization. I don't need models to be smart enough to aggressively try to divine how the organization wants them to do, the users will indeed make that happen. So this post is maybe a little bit over the top.

I am literally right now tuning my PR, Claude instructions, and PR instructions to match our standards.

Funny enough I'm having the opposite problem where Claude is lowering its rating of my PR because my testing, documentation, and error handling is better than the other code in the repository so it doesn't match and therefore gets a worse grade.

I don't need it to try any harder without explicit instructions.