Remix.run Logo
sfink 8 hours ago

Wow.

I am pretty insensitive to AI writing. I have never commented before about something sounding like AI, because mostly I don't notice. But this was so over the top that I spent the whole article trying to decide whether it was an intentional parody of AI writing style.

This article's language is not en-US. It's not en-BR. It's en-SLOP.

Yes, that was my clumsy attempt at AI parody. Here's another: this article doesn't just have AI tells. It is AI tells.

Every sentence is saturated with AI style. Perhaps the author so AI-indoctrinated that they can't see this? It doesn't read as even vaguely plausible human writing. Which is mightily ironic given the thesis of "AI generated stuff is just fine, m'kay?" The writing style does more to defeat its conclusion than the analysis itself.

As for the substance of the analysis, it seems pretty good to me but I see some flaws that weaken it a bit.

The presence of "The Outlier Nobody Noticed" proves nothing and deserves no more than a passing mention. A random release introduced way more bugs than the Claude-containing releases. That provides evidence that Claude doesn't introduce more bugs only if your hypothesis is a very naive "AI is the only thing that can ever increase bug introduction rates."

The whole analysis has very limited data. It's necessarily based off a single pair of releases at the very end of the chronological timeline. You would never be able to reject a null hypothesis based only on that, so it's even less sound to present it as proving the null hypothesis. (By the same token, it would be incorrect for critics to claim that it proves their point. Did anyone claim this, though? The heated complaints seemed more based on priors about AI code.)

"The critics' claim is a simple comparison: did the rate go up?" That's reductive. For one, these releases are known to be in reaction to a flood of (AI-discovered!) security reports, which is a novel situation and in fact is a huge confound to anyone arguing about what those two releases mean -- they're both heavily AI-written, but in response to an unusual situation. When the samples are only drawn from a distinct scenario, statistic analysis can only speak to the quality of code in that scenario.

Also, another reasonable hypothesis could be: AI-written code has bugs of a different flavor that bothers users more. It's optimized for passing tests and convincing people and AIs that security holes are closed, which means other considerations like preserving functionality can more easily be regressed as compared to if humans were doing it. (If true, it still doesn't support the claim that depending on AI code is a catastrophe, fwiw.)

I'm not arguing the conclusion is wrong. I'm saying the analysis proves far less than it claims to. As for whether it's a debacle for rsync to become dependent on AI code generation, I think that's a reasonable debate to have but it's not going to be resolved this reductively.

logicprog 6 hours ago | parent [-]

> The presence of "The Outlier Nobody Noticed" proves nothing and deserves no more than a passing mention. A random release introduced way more bugs than the Claude-containing releases. That provides evidence that Claude doesn't introduce more bugs only if your hypothesis is a very naive "AI is the only thing that can ever increase bug introduction rates."

It does not statistically prove anything, but as I thought I made extremely clear in the card where I discuss it, the point of bringing it up is different: to prove the hypocrisy of the anti-AI crowd.

> By the same token, it would be incorrect for critics to claim that it proves their point. Did anyone claim this, though? The heated complaints seemed more based on priors about AI code.

The entire outrage is because people noticed what they thought was an unusual number of bugs and/or regressions in the release, saw it had Claude in it, and assumed a causal link, not just "priors about AI code."

> You would never be able to reject a null hypothesis based only on that, so it's even less sound to present it as proving the null hypothesis.

The point I'm trying to make is that there is no evidence, based on these two releases, to think Claude made anything worse, whatsoever, and so the outrage is unfounded. This doesn't require me to prove Claude didn't cause any problems. If I ever made the latter claim, I should clean that up.

> It's optimized for passing tests and convincing people and AIs that security holes are closed, which means other considerations like preserving functionality can more easily be regressed as compared to if humans were doing it.

Tridge actually explicitly says he made that tradeoff on purpose, not the AI.

> Every sentence is saturated with AI style. Perhaps the author so AI-indoctrinated that they can't see this? It doesn't read as even vaguely plausible human writing. Which is mightily ironic given the thesis of "AI generated stuff is just fine, m'kay?" The writing style does more to defeat its conclusion than the analysis itself.

I've since rewritten nearly 100% of the prose in the analysis with my own, more inflammatory and verbose style. I also intentionally left in my natural mispellings and typos, to prove it was me.

sfink 5 hours ago | parent [-]

My post wasn't written in a way to make friends, but:

> I've since rewritten nearly 100% of the prose in the analysis with my own, more inflammatory and verbose style. I also intentionally left in my natural mispellings and typos, to prove it was me.

Thank you thank you thank you. I would love to be able to describe how hard it was for me to think about the actual evidence you're presenting when reading about it through the AI writing, but I suspect it's one of those things where it bothers you or it doesn't. If you'd like to empathize, maybe I'll give it one try: imagine an otherwise solid PhD thesis written in crayon. The facts and evidence and reasoning are unaffected, but it's just so hard to take it seriously.

Anyway, with the rewrite I don't have to battle my kneejerk reactivity nearly as much.

I'm no expert like she is, but based on what I know, I agree with your wife on the statistics. That style of analysis is going to be the best you can do with the data available. It's an accepted way to stretch data without being too dependent on an assumed distribution. It's a good analysis. I still don't come away with the conclusion that concerns about AI code maintenance are necessarily overblown, but that's fine. I think your analysis project is a very solid contribution, and it's a hell of a lot more evidence-based than the rants people were posting.