| ▲ | pron 8 hours ago | |
My problem with the code the agents produce has nothing to do with style or art. The clearest example of how bad it is was shown by Anthropic's experiements where agents failed to write a C compiler, which is not a very hard programming job to begin with if you know compilers, as the models do, but they failed even with a practically unrealistic level of assistance (a complete spec, thousands of human-written tests, and a reference implementation used as an oracle, not to mention that the models were trained on both the spec and reference implementation). If you look at the evolution of agent-written code you see that it may start out fine, but as you add more and more features, things go horribly wrong. Let's say the model runs into a wall. Sometimes the right thing to do is go back into the architecture and put a door in that spot; other times the right thing to do is ask why you hit that wall in the first place, maybe you've taken a wrong turn. The models seem to pick one or the other almost at random, and sometimes they just blast a hole through the wall. After enough features, it's clear there's no convergence, just like what happened in Anthropic's experiment. The agents ultimately can't fix one problem without breaking something else. You can also see how they shoot themselves in the foot by adding layers upon layers of defensive coding that get so think they themselves can't think through them. I once asked an agent to write a data structure that maintains an invariant in subroutine A and uses it in subroutine B. It wrote A fine, but B ignored the invariant and did a brute-force search over the data, the very thing the data structure was meant to avoid. As it was writing it the agent explained that it doesn't want to trust the invariant established in A because it might be buggy... Another thing you frequently see is that the code they write is so intent on success that it has a plan A, plan B, and plan C for everything. It tries to do something one way and adds contingencies for failure. And so the code and the complexity compound until nothing and no one can save you. If you're lucky, your program is "finished" before that happens. My experience is mostly with gpt5.4 and 5.3-codex, although Anthropic's failed experiment shows that the Claude models suffer from similar problems. What does it say when a compiler expert that knows multiple compilers pretty much by heart, with access to thousands of tests, can't even write a C compiler? Most important software is more complex than a C compiler, isn't as well specified, and the models haven't trained on it. I wish they could write working code; they just don't.[1] But man, can they debug (mostly because they're tenacious and tireless). [1]: By which I don't mean they never do, but you really can't trust them to do it as you can a programmer. Knowing to code, like knowing to fly a plane, doesn't mean sometimes getting the right result. It means always getting the right result (within your capabilities that are usually known in advance in the case of humans). | ||
| ▲ | simianwords 5 hours ago | parent [-] | |
The thing is for most places the kind of code they write is good enough. You have painted an awfully pessimistic picture that frankly does not mirror reality of many enterprises. > What does it say when a compiler expert that knows multiple compilers pretty much by heart, with access to thousands of tests, can't even write a C compiler? It does not know compilers by heart. That's just not true. The point of the experiment was to see how big of a codebase it can handle without human intervention and now we know the limits. The limitation has always been context size. >By which I don't mean they never do, but you really can't trust them to do it as you can a programmer. Knowing to code, like knowing to fly a plane, doesn't mean sometimes getting the right result. It means always getting the right result (within your capabilities that are usually known in advance in the case of humans). Getting things right ~90% of the time still saves me a lot of time. In fact I would assume this is how autopilot also works in that it does 90% of a job and the pilot is required to supervise it. | ||