Remix.run Logo
rudedogg 4 hours ago

Hearing people on tech twitter say that LLMs always produce better code than they do by hand was pretty enlightening for me.

LLMs can produce better code for languages and domains I’m not proficient in, at a much faster rate, but damn it’s rare I look at LLM output and don’t spot something I’d do measurably better.

These things are average text generation machines. Yes you can improve the output quality by writing a good prompt that activates the right weights, getting you higher quality output. But if you’re seeing output that is consistently better than what you produce by hand, you’re probably just below average at programming. And yes, it matters sometimes. Look at the number of software bugs we’re all subjected to.

And let’s not forget that code is a liability. Utilizing code that was “cheap” to generate has a cost, which I’m sure will be the subject of much conversation in the near future.

kokanee 2 hours ago | parent | next [-]

> These things are average text generation machines.

Funny... seems like about half of devs think AI writes good code, and half think it doesn't. When you consider that it is designed to replicate average output, that makes a lot of sense.

So, as insulting as OP's idea is, it would make sense that below-average devs are getting gains by using AI, and above-average devs aren't. In theory, this situation should raise the average output quality, but only if the training corpus isn't poisoned with AI output.

I have an anecdote that doesn't mean much on its own, but supports OP's thesis: there are two former coworkers in my linkedin feed who are heavy AI evangelists, and have drifted over the years from software engineering into senior business development roles at AI startups. Both of them are unquestionably in the top 5 worst coders I have ever worked with in 15 years, one of them having been fired for code quality and testing practices. Their coding ability, transition to less technical roles, and extremely vocal support for the power of vibe coding definitely would align with OP's uncharitable character evaluation.

NomDePlum 2 hours ago | parent [-]

[dead]

wild_egg 3 hours ago | parent | prev | next [-]

After a certain experience level though, I think most of us get to the point of knowing what that difference in quality actually matters.

Some seniors love to bikeshed PRs all day because they can do it better but generally that activity has zero actual value. Sometimes it matters, often it doesn't.

Stop with the "I could do this better by hand" and ask "is it worth the extra 4 hours to do this by hand, or is this actually good enough to meet the goals?"

throwawayffffas 3 hours ago | parent | next [-]

LLM generated code is technical debt. If you are still working on the codebase the next day it will bite you. It might be as simple as an inconvenient interface, a bunch of duplicated functions that could just be imported, but eventually you are going to have to pay it.

visarga 2 hours ago | parent | next [-]

Untested undocumented LLM code is technical debt, but if you do specs and tests it's actually the opposite, you can go beyond technical debt and regenerate your code as you like. You just need testing to be so good it guarantees the behavior you care about, and that is easier in our age of AI coding agents.

mkroman 2 hours ago | parent [-]

> but if you do specs and tests it's actually the opposite, you can go beyond technical debt and regenerate your code as you like.

Having to write all the specs and tests just right so you can regenerate the code until you get the desired output just sounds like an expensive version of the infinite monkey theorem, but with LLMs instead of monkeys.

fourthark 2 hours ago | parent [-]

You can have it write the specs and tests, too, and review and refine them much faster than you could write them.

ymyms 3 hours ago | parent | prev | next [-]

All code is technical debt though. We can't spend infinite hours finding the absolute minima of technical debt introduced for a change, so it is just finding the right balance. That balance is highly dependent on a huge amount of factors: how core is the system, what is the system used for, what stage of development is the system, etc.

2muchcoffeeman 2 hours ago | parent | prev | next [-]

Are people not reviewing and refactoring LLM code?

bdangubic 3 hours ago | parent | prev [-]

In your comment replace “LLM” with “Human SWE” and statement will still be correct in vast majority of the situations :)

throwawayffffas 3 hours ago | parent [-]

That's legit true. All code is technical debt. Human SWEs have one saving grace. Sometimes they refactor and reduce some of the debt.

ymyms 3 hours ago | parent [-]

A human SWE can use an LLM to refactor and reduce some of the debt just as easily too. I think fundamentally, the possible rate of new code and new technical debt introduced by LLMs is much higher than a human SWE. Left unchecked, a human still needs sleep and more humans can't be added with more compute.

There's an interesting aspect to the LLM debt being taken on though in that I'm sure some are taking it on now in the bet/hopes that further advancements in LLMs will make it more easily addressable in the future before it is a real problem.

Arch-TK 3 hours ago | parent | prev | next [-]

"actually good enough to meet the goals?"

There's "okay for now" and then there's "this is so crap that if we set our bar this low we'll be knee deep in tech debt in a month".

A lot of LLM output in the specific areas _I_ work in is firmly in that latter category and many times just doesn't work.

gbnwl 3 hours ago | parent | next [-]

So I can tell you don’t use these tools, or at least much, because at the speed of development with them you’ll be knee deep in tech debt in a day, not a month, but as a corollary can have the same agentic coding tools undergo the equivalent of weeks of addressing tech debt the next day. Well, I think this applies to greenfield AI-first oriented projects that work this way from the get go and with few humans in the loop (human to human communication definitely becomes the rate limiting step). But I imagine that’s not the nature of your work.

daveguy 10 minutes ago | parent [-]

I think you missed the your parent post's phrase "in the specific areas _I_ work in" ... LLMs are a lot better at crud and boilerplate than novel hardware interfaces and a bunch of other domains.

ambicapter 2 hours ago | parent | prev [-]

I mean, there's also, "this looks fine but if I actually had written this code I would've naturally spent more time on it which would have led me to anticipate the future of this code just a little bit more and I will only feel that awkwardness when I come back to this code in two weeks, and then we'll do it all over again". It's a spectrum.

3 hours ago | parent | prev | next [-]
[deleted]
hu3 3 hours ago | parent | prev | next [-]

Perhaps writing code by hand will be considered micro optimisation in the future.

Just like writing assembly is today.

rtpg 2 hours ago | parent | prev [-]

now sometimes that's 4 hours, but I've had plenty of times where I'm "racing" people using LLMs and I basically get the coding done before them. Once I debugged an issue before the robot was done `ls`-ing the codebase!

The shape of the problem is super important in considering the results here

christophilus 3 hours ago | parent | prev | next [-]

Claude Code is way better than I am at rummaging through Git history, handling merge conflicts, renaming things, writing SQL queries whose syntax I always forget (window functions and the like). But yeah. If I give it a big, non-specific task, it generates a lot of mediocre (or worse) code.

throwawayffffas 3 hours ago | parent [-]

That's funny that's all the things I don't trust it to do. I actually use it the other way around, give it a big non-specific task, see if it works, specify better, retry, throw away 60% - 90% of the generated code, fix bugs in a bunch of places and out comes an implemented feature.

bryanlarsen 2 hours ago | parent [-]

Agreed. Claude is horrible at munging git history and can destroy the thing I depend on to fix Claude's messes. I always do my git rebasing by hand.

The first iteration of Claude code is usually a big over-coded mess, but it's pretty good at iterating to clean it up, given proper instruction.

rectang 2 hours ago | parent [-]

I give the agent the following standing instructions:

"Make the smallest possible change. Do not refactor existing code unless I explicitly ask."

That directive cut down considerably on the amount of extra changes I had to review. When it gets it right, the changes are close to the right size now.

The agent still tries to do too much, typically suggesting three tangents for every interaction.

godzillabrennus 3 hours ago | parent | prev | next [-]

As someone who: 1.) Has a brain that is not wired to think like a computer and write code. 2.) Falls asleep at the keyboard while writing code for more than an hour or two. 3.) Has a lot of passion for sticking with an idea and making it happen, even if that means writing code and knowing the code is crap.

So, in short, LLMs write better code than I do. I'm not alone.

djaouen 3 hours ago | parent [-]

You are defective. But, fear not! So is the rest of humanity!

CapsAdmin 3 hours ago | parent | prev | next [-]

I've been playing with vibe coding a lot lately and I think in most cases, the current SOTA LLM's don't produce code that I'd be satisfied with. I kind of feel like LLM's are really really good at hacking on a messy and fragile structure, because they can "keep track many things in their head"

BUT

An LLM can write a PNG decoder that works in whatever language I choose in one or a few shots. I can do that too, but it will take me longer than a minute!

(and I might learn something about the png format that might be useful later..)

Also, us engineers can talk about code quality all day, but does this really matter to non-engineers? Maybe objectively it does, but can we convince them that it does?

blibble 2 hours ago | parent | next [-]

> Maybe objectively it does, but can we convince them that it does?

how long would you give our current civilisation if quality of software ceased to be important for:

  - medical devices
  - aircraft
  - railway signalling systems
  - engine management systems
  - the financial system
  - electrical grid
  - water treatment
  - and every other critical system
unless "AI" dies, we're going to find out
zahlman 2 hours ago | parent | prev | next [-]

I kinda like Theo's take on it (https://www.youtube.com/watch?v=Z9UxjmNF7b0): there's a sliding scale of how much slop should reasonably be considered acceptable and engineers are well advised to think about it more seriously. I'm less sold on the potential benefits (since some of the examples he's given are things that I would also find easy by hand), but I agree with the general principle that having the option to do things in a super-sloppy way, combined with spending time developing intuition around having that access (and what could be accomplished that way), can produce positive feedback loops.

In short: when you produce the PNG decoder, and are satisfied with it, it's because you don't have a good reason to care about the code quality.

> Maybe objectively it does, but can we convince them that it does?

I strongly doubt it, and that's why articles like TFA project quite a bit of concern for the future. If non-engineers end up accepting results from a low-quality, not-quite-correct system, that's on them. If those results compromise credentials, corrupt databases etc., not so much.

raddan 3 hours ago | parent | prev [-]

I tried vibe coding a BMP decoder not too long ago with the rationale being “what’s simpler than BMP?”

What I got was an absolute mess that did not work at all. Perhaps this was because, in retrospect, BMP is not actually all that simple, a fact that I discovered when I did write a BMP decoder by hand. But I spent equal time vibe coding and real coding. At the end of the real coding session, I understood BMP, which I see as a benefit unto itself. This is perhaps a bit cynical but my hot take on vibe coders is that they place little value on understanding things.

gbnwl 3 hours ago | parent | next [-]

Mind explaining the process you tried? As someone who’s generally not had any issue getting LLMs to sort out my side projects (ofc with my active involvement as well), I really wonder what people who report these results are trying. Did you just open a chat with claude code and try to get a single context window to one shot it?

zahlman 2 hours ago | parent | prev [-]

Just out of curiousity (as someone fairly familiar with the BMP spec, and also PNG incidentally): what did you find to be the trickiest/most complex aspects?

raddan 2 hours ago | parent [-]

None of this is fresh in my mind, so my recollections might be a little hazy. I think the only issue I personally had when writing a decoder was keeping the alignment of various fields right. I wrote the decoder in C# and if I remember correctly I tried to get fancy with some modern-ish deserialization code. I think I eventually resorted to writing a rather ugly but simple low-level byte reader. Nevertheless I found it to be a relatively straightforward program to write and I got most of what I wanted done in under a day.

The vibe coded version was a different story. For simplicity, I wanted to stick to an early version of BMP. I don’t remember the version off the top of my head. This was a simplified implementation for students to use and modify in a class setting. Sticking to early version BMPs also made it harder for students to go off-piste since random BMPs found on the internet probably would not work.

The main problem was that the LLM struggled to stick to a specific version of BMP. Some of those newer features (compression, color table, etc, if I recall correctly) have to be used in a coordinated way. The LLM made a real mess here, mixing and matching newer features with older ones. But I did not understand that this was the problem until I gave up and started writing things myself.

paodealho 3 hours ago | parent | prev | next [-]

I worked for a relatively large company (around 400 employees there are programmers). The people who embraced LLM-generated code clearly share one trait: they are feature pushers who love to say "yes!" to management. You see, management is always right, and these programmers are always so eager to put their requirements, however incomplete, into a Copilot session and open a pull request as fast as possible.

The worst case I remember happened a few months ago when a staff (!) engineer gave a presentation about benchmarks they had done between Java and Kotlin concurrency tools and how to write concurrent code. There was a very large and strange difference in performance favoring Kotlin that didn't make sense. When I dug into their code, it was clear everything had been generated by a LLM (lots of comments with emojis, for example) and the Java code was just wrong.

The competent programmers I've seen there use LLMs to generate some shell scripts, small python automations or to explore ideas. Most of the time they are unimpressed by these tools.

rytill 2 hours ago | parent | prev | next [-]

LLMs are not "average text generation machines" once they have context. LLMs learn a distribution.

The moment you start the prompt with "You are an interactive CLI tool that helps users with software engineering at the level of a veteran expert" you have biased the LLM such that the tokens it produces are from a very non-average part of the distribution it's modeling.

ch4s3 3 hours ago | parent | prev | next [-]

In my view they’re great for rough drafts, iterating on ideas, throwaway code, and pushing into areas I haven’t become proficient in yet. I think in a lot of cases they write ok enough tests.

nitwit005 3 hours ago | parent | prev | next [-]

It'd be rather surprising if you could train an AI on a bunch of average code, and somehow get code that's always above average. Where did the improvement come from?

We should feed the output code back in to get even better code.

zahlman 3 hours ago | parent [-]

AI generally can improve through reinforcement learning, but this requires it to be able to compare its output to some form of metric. There aren't a lot of people I'd trust to RLHF for code quality, and anything more automated than that is destined to collapse due to Goodhart's Law.

throwawayffffas 3 hours ago | parent | prev | next [-]

> Hearing people on tech twitter say that LLMs always produce better code than they do by hand was pretty enlightening for me.

That's hilarious LLM code is always very bad. It's only merit is it occasionally works.

> LLMs can produce better code for languages and domains I’m not proficient in.

I am sure that's not true.

caycep 2 hours ago | parent | next [-]

I think it says more about who's still on tech twitter vs. anything about the llm....

ambicapter 2 hours ago | parent | prev [-]

It seems true by construction. If you're not proficient in a language than the bar for "better than you" is necessarily lower.

abighamb 3 hours ago | parent | prev | next [-]

This has been my experience as well.

It's let me apply my general knowledge across domains, and do things in tech stacks or languages I don't know well. But that has also cost me hours debugging a solution I don't quite understand.

When working in my core stack though it's a nice force multiplier for routine changes.

logicallee 3 hours ago | parent [-]

>When working in my core stack though it's a nice force multiplier for routine changes.

what's your core stack?

jimbo1167 3 hours ago | parent | prev | next [-]

How are you judging that you'd write "better" code? More maintainable? More efficient? Does it produce bugs in the underlying code it's generating? Genuinely curious where you see the current gaps.

jordwest 3 hours ago | parent [-]

For me the biggest gaps in LLM code are:

- it adds superfluous logic that is assumed but isn’t necessary

- as a result the code is more complex, verbose, harder to follow

- it doesn’t quite match the domain because it makes a bunch of assumptions that aren’t true in this particular domain

They’re things that can often be missed in a first pass look at the code but end up adding a lot of accidental complexity that bites you later.

When reading an unfamiliar code base we tend to assume that a certain bit of logic is there for a good reason, and that helps you understand what the system is trying to do. With generative codebases we can’t really assume that anymore unless the code has been thoroughly audited/reviewed/rewritten, at which point I find it’s easier to just write the code myself.

CapsAdmin 3 hours ago | parent | next [-]

This has been my experience as well. But, these are things we developers care about.

Coding aside, LLM's aren't very good at following nice practices in general unless explicitly prompted to. For example if you ask an LLM to create an error modal box from scratch, will it also implement the ability to select the text, or being able to ctrl c to copy the text, or perhaps a copy message button? Maybe this is a bad example, but they usually don't do things like this unless you explicitly ask them to. I don't personally care too much about this, but I think it's noteworthy in the context of lay people using LLM's to vibe code.

zahlman 2 hours ago | parent | prev [-]

I've seen a lot of examples where it fails to take advantage of previous work and rewrites functionality from scratch.

moron4hire an hour ago | parent | prev | next [-]

If firing up old coal plants and skyrocketing RAM prices and $5000 consumer GPUs and violating millions of developers' copyrights and occasionally coaxing someone into killing themselves is the cost of Brian Who Got Peter Principled Into Middle Management getting to enjoy programming again instead of blaming his kids for why he watches football and drinks all weekend instead of cultivating a hobby, I guess we have no choice but to oblige him his little treat.

bdangubic 3 hours ago | parent | prev | next [-]

> if you’re seeing output that is consistently better than what you produce by hand, you’re probably just below average at programming

even though this statement does not mathematically / statistically make sense - vast majority of SWEs are “below average.” therein lies the crux of this debate. I’ve been coding since the 90’s and:

- LLM output is better than mine from the 90’s

- LLM output is better than mine from early 2000’s

- LLM output is worse than any of mine from 2010 onward

- LLM output (in the right hands) is better than 90% of human-written code I have seen (and I’ve seen a lot)

iwontberude 3 hours ago | parent | prev | next [-]

The most prolific coders are also more competent than average. Their proliferations are what have trained these models. These models are trained on incredibly successful projects written by masters of their fields. This is usually where I find the most pushback is that the most competent SWEs see it as theft and also useless to them since they have already spend years honing skills to work relentlessly and efficiently towards solutions -- sometimes at great expense.

nitwit005 3 hours ago | parent | next [-]

I'd assume most of the code visible on the web leans amateur. A huge portion of github repos seem to be from students these days. You'll see GitHub's Education page listing "5 million students" (https://github.com/education), which I assume is an under-estimate, as that's only the formal program.

habinero 3 hours ago | parent | prev [-]

> The most prolific coders are also more competent than average

This is absolutely not true lol, as anyone who's worked with a fabled 10X engineer will tell you. It's like saying the best civil engineer is the one that builds the most bridges.

The best code looks real boring.

iwontberude 3 hours ago | parent [-]

I've worked with a 10x engineer and indeed they were significantly more competent than the rest of the team in their execution and design. They've seen so many problems and had a chance to discard bad patterns and adopt/try out new ones.

habinero 3 hours ago | parent | prev [-]

I saw someone refer to it as future digital asbestos.