Remix.run Logo
stitched2gethr 2 days ago

This contains some specific data with pretty graphs: https://youtu.be/tbDDYKRFjhk?t=623

But if you do professional development and use something like Claude Code (the current standard, IMO) you'll quickly get a handle on what it's good at and what it isn't. I think it took me about 3-4 weeks of working with it at an overall 0x gain to realize what it's going to help me with and what it will make take longer.

MontyCarloHall 2 days ago | parent [-]

This is a great conference talk, thanks for sharing!

To summarize, the authors enlisted a panel of expert developers to review the quality of various pull requests, in terms of architecture, readability, maintainability, etc. (see 8:27 in the video for a partial list of criteria), and then somehow aggregate these criteria into an overall "productivity score." They then trained a model on the judgments of the expert developers, and found that their model had a high correlation with the experts' judgment. Finally, they applied this model to PRs across thousands of codebases, with knowledge of whether the PR was AI-assisted or not.

They found a 35-40% productivity gain for easy/greenfield tasks, 10-15% for hard/greenfield tasks, 15-20% for easy/brownfield tasks, and 0-10% for hard/brownfield tasks. Most productivity gains went towards "reworked" code, i.e. refactoring of recent code.

All in all, this is a great attempt at rigorously quantifying AI impact. However, I do take one major issue with it. Let's assume that their "productivity score" does indeed capture the overall quality of a PR (big assumption). I'm not sure this measures the overall net positive/negative impact to the codebase. Just because a PR is well-written according to a panel of expert engineers doesn't mean it's valuable to the project as a whole. Plenty of well-written code is utterly superfluous (trivial object setters/getters come to mind). Conversely, code that might appear poorly written to an outsider expert engineer could be essential to the project (the highly optimized C/assembly code of ffmpeg comes to mind, or to use an extreme example, anything from Arthur Whitney). "Reworking" that code to be "better written" would be hugely detrimental, even though the judgment of an outside observer (and an AI trained on it) might conclude that said code is terrible.