Remix.run Logo
MontyCarloHall 2 days ago

Forget a livestream, I want to hear from maintainers of complex, actively developed, and widely used open-source projects (e.g. ffmpeg, curl, openssh, sqlite). Highly capable coding LLMs have been out for long enough that if they do indeed have meaningful impact on writing non-trivial, non-greenfield/boilerplate code, it ought to be clearly apparent in an uptick of positive contributions to projects like these.

stitched2gethr 2 days ago | parent | next [-]

This contains some specific data with pretty graphs: https://youtu.be/tbDDYKRFjhk?t=623

But if you do professional development and use something like Claude Code (the current standard, IMO) you'll quickly get a handle on what it's good at and what it isn't. I think it took me about 3-4 weeks of working with it at an overall 0x gain to realize what it's going to help me with and what it will make take longer.

MontyCarloHall 2 days ago | parent [-]

This is a great conference talk, thanks for sharing!

To summarize, the authors enlisted a panel of expert developers to review the quality of various pull requests, in terms of architecture, readability, maintainability, etc. (see 8:27 in the video for a partial list of criteria), and then somehow aggregate these criteria into an overall "productivity score." They then trained a model on the judgments of the expert developers, and found that their model had a high correlation with the experts' judgment. Finally, they applied this model to PRs across thousands of codebases, with knowledge of whether the PR was AI-assisted or not.

They found a 35-40% productivity gain for easy/greenfield tasks, 10-15% for hard/greenfield tasks, 15-20% for easy/brownfield tasks, and 0-10% for hard/brownfield tasks. Most productivity gains went towards "reworked" code, i.e. refactoring of recent code.

All in all, this is a great attempt at rigorously quantifying AI impact. However, I do take one major issue with it. Let's assume that their "productivity score" does indeed capture the overall quality of a PR (big assumption). I'm not sure this measures the overall net positive/negative impact to the codebase. Just because a PR is well-written according to a panel of expert engineers doesn't mean it's valuable to the project as a whole. Plenty of well-written code is utterly superfluous (trivial object setters/getters come to mind). Conversely, code that might appear poorly written to an outsider expert engineer could be essential to the project (the highly optimized C/assembly code of ffmpeg comes to mind, or to use an extreme example, anything from Arthur Whitney). "Reworking" that code to be "better written" would be hugely detrimental, even though the judgment of an outside observer (and an AI trained on it) might conclude that said code is terrible.

rhubarbtree a day ago | parent | prev | next [-]

Yes, this would be really useful.

AI coding should be transforming OSS, and we should be able to get a rough idea of the scale of the speed up in development. It’s an ideal application area.

brookst 2 days ago | parent | prev [-]

So what percentage of human programmers, in the entire world, do you think contribute to meaningful projects like those?

MontyCarloHall 2 days ago | parent [-]

I picked these specific projects because they are a) mature, b) complex, and as a result c) unlikely to have development needs for lots of new boilerplate code.

I would estimate the majority of developers spend most of their time on problems encompassing all three of these, even if their software is not as meaningful/widely used as the previous examples. Everyone knows that LLMs are fantastic at generating greenfield boilerplate very quickly. They are an invaluable rapid prototyping/MVP generation tool, and that in itself is hugely useful.

But that's not where developers spend most of their time. They spend it maintaining complicated, mature codebases, and the utility of LLMs is much less proven for that use case. This utility would be most easily measured in contributions to open-source projects, since all commits are public and maintainers have no monetary incentive to misrepresent the impact of AI [0, 1, 2, ...].

[0] https://www.businessinsider.com/anthropic-ceo-ai-90-percent-...

[1] https://www.cnbc.com/2025/06/26/ai-salesforce-benioff.html

[2] https://www.cnbc.com/2025/04/29/satya-nadella-says-as-much-a...