Remix.run Logo
pron 10 hours ago

I generally fall into the first camp, too, but the code that AI produces is problematic because it's code that will stop working in an unrecoverable way after some number of changes. That's what happened in the Anthropic C compiler experiment (they ended up with a codebase that wasn't working and couldn't be fixed), and that's what happens once every 3-5 changes I see Codex making in my own codebase. I think, if I had let that code in, the project would have been destroyed in another 10 or so changes, in the sense that it would be impossible to fix a bug without creating another. We're not talking style or elegance here. We're talking ticking time bombs.

I think that the real two camps here are those who haven't carefully - and I mean really carefully - reviewed the code the agents write and haven't put their process under some real stress test vs those who have. Obviously, people who don't look for the time bombs naturally think everything is fine. That's how time bombs work.

I can make this more concrete. The program wants to depend on some invariant, say that a particular list is always sorted, and the code maintains it by always inserting elements in the right place in the list. Other code that needs to search for an element depends on that invariant. Then it turns out that under some conditions - due to concurrency, say - an element is inserted in the wrong place and the list isn't sorted, so one of the places that tries to find an element in the list fails to find it. At that point, it's a coin toss of whether the agent will fix the insertion or the search. If it fixes the search, the bug is still there for all the other consumers of the list, but the testing didn't catch that. Then what happens is that, with further changes, depending on their scope, you find that some new code depends on the intended invariant and some doesn't. After several such splits and several failed invariants, the program ends up in a place that nothing can be done to fix a bug. If the project is "done" before that happens - you're in luck; if not, you're in deep, deep trouble. But right up until that point, unless you very carefully review the code (because the agents are really good at making code seem reasonable under cursory scrutiny), you think everything is fine. Unless you go looking for cracks, every building seems stable until some catastrophic failure, and AI-generated code is full of cracks that are just waiting for the right weight distribution to break open and collapse.

So it sounds to me that the people you think are in the first camp not only just care how the building is built as long as it doesn't collapse, but also believe that if it hasn't collapsed yet it must be stable. The first part is, indeed, a matter of perspective, but the second part is just wrong (not just in principle but also when you actually see the AI's full-of-cracks code).

zozbot234 8 hours ago | parent | next [-]

> The program wants to depend on some invariant, say that a particular list is always sorted, and the code maintains it by always inserting elements in the right place in the list.

Invariants must be documented as part of defining the data or program module, and ideally they should be restated at any place they're being relied upon. If you fail to do so, that's a major failure of modularity and it's completely foreseeable that you'll have trouble evolving that code.

pron 7 hours ago | parent [-]

Right, except even when the invariants are documented agents get into trouble. Virtually every week I see the agent write strange code with multiple paths. It knows that the invariant _should_ hold, but it still writes a workaround for cases it doesn't. Something I see even more frequently is where the agent knows a certain exception shouldn't occur, but it does, so half the time it will choose to investigate and half the time it says, oh well, and catches the exception. In fact, it's worse. Sometimes it catches exceptions that shouldn't occur proactively as part of its "success at all costs" drive, and all these contingency plans it builds into the code make it very hard (even for the agent) to figure out why things go wrong.

Most importantly, this isn't hypothetical. We see that agents write programs that after some number of changes just collapse because they don't converge. They don't transition well between layers of abstractions, so they build contingencies into multiple layers, and the result is that after some time the codebase is just broken beyond repair and no changes can be made without breaking something (and because of all the contingencies, reproducing the breakage can be hard). This is why agents don't succeed in building even something as simple as a workable C compiler even with a full spec and thousands of human-written tests.

If the agents could code well, no one would be complaining. People complain because agent code becomes structurally unsound over time, and then it's only a matter of time until it collapses. Every fix and change you make without super careful supervision has a high chance of weakening the structure.

zozbot234 7 hours ago | parent [-]

Agents don't really know the whole codebase when they're writing the code, their context is way too tiny for that; and trying to grow context numbers doesn't really work well (most of it gets ignored). So they're always working piece-meal and these failures are entirely expected unless the codebase is rigorously built for modularity and the agent is told to work "in the small" and keep to the existing constraints.

pron 5 hours ago | parent [-]

> Agents don't really know the whole codebase when they're writing the code

Neither do people, yet people manage to write software that they can evolve over a long time, and agents have yet to do that. I think it's because people can move back and forth between levels of abstraction, and they know when it's best to do it, but agents seem to have a really hard time doing that.

On the other hand, agents are very good at debugging complex bugs that span many parts of the codebase, and they manage to do that even with their context limitations. So I don't think it's about context. They're just not smart enough to write stable code yet.

skydhash 9 hours ago | parent | prev [-]

It can be especially bad if the architecture is layered with each one having its own invariant. Like in a music player, you may have the concept of a queue in the domain layer, but in the UI layer you may have additional constraints that does not relate to that. Then the agent decide to fix a bug in the UI layer because the description is a UI bug, while it’s in fact a queue bug

mikkupikku 9 hours ago | parent [-]

Shit like this is why you really have to read the plans instead of blindly accepting them. The bots are naturally lazy and will take short cuts whenever they think you won't notice.