> Agents are about as good as a random choice in picking the right answer, and there's typically only one right answer.

That's realistically because they aren't even trying to answer that question by thinking sensibly about the code. Working in a limited context with anything they do leaves them guessing and trying the first thing that might work. That's why they generally do a bit better when you explicitly ask them to reverse engineer/document a design of some existing codebase: that's a problem that at least involves an explicit requirement to comprehensively survey the code, figure out what part matters, etc. They can't be expected to do that as a default. It's not even a limitation of existing models, it's quite inherent to how they're architected.

▲

pron 2 days ago | parent [-]

Yes, and I think there's a fundamental problem here. The big reason the "AI thought leadership" claim that AI should do well at coding is because there are mechanical success metrics like tests. Except that's not true. The tests cover the behaviour, not the structure. It's like constructing a building where the only tests are whether floorplans match the design. It makes catastrophic strctural issues easy to hide. The building looks right, and it might even withstand some load, but later, when you want to make changes, you move a cupboard or a curtain rod only to have the structure collapse because that element ended up being load-bearing.

It's funny, but one of the lessons I've learnt working with agents is just how much design matters in software and isn't just a matter of craftsmenship pride. When you see the codebase implode after the tenth new feature and realise it has to be scrapped because neither human nor AI can salvage it, the importance of design becomes palpable. Before agents it was hard to see because few people write code like that (just as no one would think to make a curtain rod load-bearing when building a structure).

And let's not forget that the models hallucinate. Just now I was discussing architecture with Codex, and what it says sounds plausible, but it's wrong in subtle and important ways.

▲

zozbot234 2 days ago | parent [-]

> The big reason the "AI thought leadership" claim that AI should do well at coding is because there are mechanical success metrics like tests.

I mean, if you properly define "do well" as getting a first draft of something interesting that might or might not be a step towards a solution, that's not completely wrong. A pass/fail test is verified feedback of a sort, that the AI can then do quick iteration on. It's just very wrong to expect that you can get away with only checking for passing tests and not even loosely survey what the AI generated (which is invariably what people do when they submit a bunch of vibe-coded pull requests that are 10k lines each or more, and call that a "gain" in productivity).

	▲	pron 2 days ago \| parent [-]
		It's not completely wrong if you're interested in a throwaway codebase. It is completely wrong if what you want is a codebase you'll evolve over years. Agents are nowhere close to offering that (yet) unless a human is watching them like a hawk (closer than you'd watch another human programmer, because human programmers don't make such dangerous mistakes as frequently, and when they do make them, they don't hide them as well).