Preliminary data from a longitudinal AI impact study

SirensOfTitan 3 hours ago | parent | next [-]

This reads as incredibly damning to me. PR throughput should be a metric that is very supportive of the AI productivity narrative, but the effect is marginal.

Before everyone gets at me: smoking cigarettes increases your risk of lung cancer by 15-30x. Effect size matters. As does margin of error: what is the margin of error? This "increase" could easily be within noise.

PR throughput is also not a metric I would ever use to determine developer productivity for a paradigm shifting technology. I would only ever use it to compare like-to-like to find trailheads: is a team or person suddenly way more or less productive? The primary endpoint for software production is serving your customer or your mission, and PR throughput can't tell you whether any of that got better. It also cannot tell you the cost of your prior work: the increase in PR throughput could be more PRs to fix issues introduced by LLM-assisted work.

	▲	lumost an hour ago \| parent [-]
		I suspect the issue is the SDLC methodology of existing mature products. The "I can build it in a weekend" use case has gotten a massive boost as you can build something which "looks" real faster then ever. Mature teams need to deal with backwards compatibility and real development risk.

▲

jwilliams 2 hours ago | parent | prev | next [-]

I wrote a short bit on a similar topic the other day[^a]. Just because something is faster or even measurably better, that doesn't translate to end productivity.

1. You might be speeding up something that is inherently not productive (the "faster horses" trope). I see companies using AI to generate performance reviews. Same company using AI to summarize all the new performance stuff they're getting. All that's happening is amplified busywork (there is real work in there, but questionable if it's improved).

2. Some things are zero sum. If you're not using AI for marketing you might fall behind. So you adopt these tools, but attention/etc are limited. There is no net gain, just competition.

3. You might speed one part up (typing code), but then other parts of your pipeline quickly become constraints. It might be a long time before we're able to adapt the end-to-end process. This is amplified by coding tools being three strides ahead.

4. Then there are actual productivity improvements. One of these PRs could have been "translate this to German". That could be one PR but a whole step-change for the business.

So much of what is happening falls in buckets 1+2+3. I don't think we've really got into the meat of 4 yet.

a: https://jonathannen.com/ai-productivity/

▲

0xbadc0de5 4 hours ago | parent | prev | next [-]

Fair assessment. And worth noting that in a sane world, a broad 10% productivity improvement across industry would be a once-in-a-lifetime, headline-making story, not a disappointment.

	▲	Swizec 2 hours ago \| parent \| next [-]
		> And worth noting that in a sane world, a broad 10% productivity improvement across industry would be a once-in-a-lifetime, headline-making story, not a disappointment The biggest risk in software development is building the wrong thing. Digging yourself into a hole 10% faster is _worse_. You now have more backtracking to do!
	▲	nubg 2 hours ago \| parent \| prev [-]
		Agreed, but if that came at a cost of 1 trillion dollars of debt and investments, it might be a disappointment again. Note that I am bullish on AI coding in general, just trying to contextualize your statement.

▲

rybosworld 4 hours ago | parent | prev | next [-]

> Planning, alignment, scoping, code review, and handoffs—the human parts of the SDLC—remain largely untouched

Seems likely that process is holding things back. Planning has always been a "best-guess". There's lots you can't account for until you start a task.

Code review mostly exists because the cost of doing something wrong was high (because human coding is slow). If you can code faster, you can replace bad code faster. I.e., LLMs have cheapened the cost of deployment.

We can't honestly assess the new way of doing things when we bring along the baggage of the old way of doing things.

	▲	felipeerias 3 hours ago \| parent \| next [-]
		Planning might end up being more reliable thanks to coding agents: if you want to estimate how long a task would take, just send an agent to do it. If the agent comes back in a few minutes with a tiny fix, it is probably a small task. If the agent produces a large, convoluted solution that would need careful review, it is at least a medium task. And if the agent gets stuck, runs into architectural constraints, etc. then it is definitely a hard task.
	▲	nine_k 41 minutes ago \| parent \| prev [-]
		The cost of doing something wrong still is high. Even if bad code is produced instantaneously, its detrimental effect on production remains the same. Yes, yes, what fell on the floor and was picked up in five seconds is still considered fine to eat! Does not apply to eggs though. Customer trust is usually such an egg. Writing code has become much faster. Writing correct and reliable code has become somehow faster, but not nearly as much. Understanding what code to write has barely become faster. The more novel is the code you're writing, the smaller are gains from AI writing it.

▲

naasking 3 hours ago | parent | prev | next [-]

Sounds reasonable, but gains will go up. There is a ceiling somewhere, but we don't know where it is.

▲

Insanity 3 hours ago | parent [-]

Yup, and the ceiling could be at 11% or at 50%. But my bet is closer to a lower-range ceiling than an upper-range. Model's are no longer revolutionary, they are evolutionary, and the evolution and per model-version difference is narrowing each release.

	▲	naasking 2 hours ago \| parent [-]
		> Model's are no longer revolutionary, they are evolutionary, and the evolution and per model-version difference is narrowing each release. We've definitely culled some low hanging fruit, but I think there's still a lot of room for improvements that could lead to step changes in capabilities. I think we're only scratching the surface of looped language models, thinking in latent space, and multimodality. And even if the per-model differences are narrowing, even single digit improvements in performance metrics could yield outsized effects in applicability and productivity. Consider services that guarantee one 9 of reliability vs. five 9s. In absolute terms that change is a trivia difference, but the increased reliability allows use in way, way more domains.

▲

deterministic an hour ago | parent | prev | next [-]

If you think a department or individual working 10% faster makes a company 10% more productive, you’re almost certainly wrong.

Productivity only improves if the change increases revenue or reduces costs. And that rarely happens unless you improve the actual bottleneck of the organization.

To understand why, I recommend the book The Goal: A Process of Ongoing Improvement by Eliyahu M. Goldratt and Jeff Cox.

▲

nemo44x 2 hours ago | parent | prev | next [-]

At the very least teams will communicate with each other much better. So much of the tedium of office work is able to be automated so people can spend more time solving problems instead.

But the communication will massively improve. More artifacts being generated of progress and needs and AI can link related things around an organization rapidly and accurately. Workflows will massively improve. A living graph of an entire organization will come to life.

I think more productivity gains will come from this automation than anything. People will look back at all the drudgery workers did.

▲

enraged_camel 4 hours ago | parent | prev | next [-]

>> November 2024 through February 2026

Yeah, listen... I'm glad these types of studies are being conducted. I'll say this though: the difference between pre- and post-Opus 4.5 has been night and day for me.

From August 2025 through November 2025 I led a complex project at work where I used Sonnet 4.5 heavily. It was very helpful, but my total productivity gains were around 10-15%, which is pretty much what the study found. Once Opus came out in November though, it was like someone flipped a switch. It was much more capable at autonomous work and required way less hand-holding, intervention or course-correction. 4.6 has been even better.

So I'm much more interested in reading studies like this over the next two years where the start period coincides with Opus 4.5's release.

▲

jackschultz 3 hours ago | parent | next [-]

Very much agree. Gave a presentation on AI to a group earlier this week and I spent a third of the time talking about the Opus 4.5 inflection point in AI history. First time using that model the day it was released it was so clear that it knew what it was doing at a different level. People still jump around to different models or tools or time frames when talking about AI and usefulness, but those have no meaning if they’re not using the Opus 4.5 and 4.6 models and anthropic harnesses of Claude code or cowork.

I’m interested in the studies along with the history of AI and if they’re going to realize that was the point when things changed, because for us devs, that was the moment.

	▲	nubg 2 hours ago \| parent [-]
		Would you mind sharing the presentation? Or an AI summary of it.

▲

esseph 3 hours ago | parent | prev | next [-]

I swear people say this with every single model and release version, without fail.

▲

slopinthebag 3 hours ago | parent | prev [-]

> It was very helpful, but my total productivity gains were around 10-15%, which is pretty much what the study found. Once Opus came out in November though, it was like someone flipped a switch. It was much more capable at autonomous work and required way less hand-holding, intervention or course-correction. 4.6 has been even better.

How did you track these gains?

▲

jongjong 3 hours ago | parent | prev | next [-]

As I've said before, AI is a force multiplier. A 10x developer is now a 100x developer and a -10x developer (complexity maker/value destroyer) is now a -100x developer.

I can understand why a lot of companies are cutting junior roles. What AI does is it automates most of the stuff that juniors are good at (coding fast) but not much of the stuff that the seniors are good at.

That said, I've worked with some juniors who managed to navigate; they do this by focusing on higher order thinking and developing a sense of what's important by interacting with senior engineers. Unfortunately, it raises the talent bar for juniors; they have to become more intelligent; not in a puzzle-solving way, but in a more architectural big-picture sort of way; almost like entrepreneurial thinking but more detailed/complex.

LLMs don't have a worldview; this means that they miss a lot of inconsistencies and logical contradictions. Also, most critically, LLMs don't know what's important (at least not accurately enough) so they can't prioritize effectively and they make a lot of bad decisions.

It's kind of interesting for me because a lot of the areas where I had a contrarian opinion in the field of software development, I now see LLMs getting trapped into those and getting bad results. It's like all my contrarian opinions became much more valuable.

▲

arisAlexis 5 hours ago | parent | prev | next [-]

because the human may be the bottleneck soon

▲

eucyclos 3 hours ago | parent [-]

It might be more accurate to say humans will only work at the bottlenecks soon, unless I've misunderstood the vector of your commentary.

	▲	SiempreViernes 3 hours ago \| parent [-]
		A lot of AI-boosting commentary does speak in terms where there are hardly any humans left in their dream world, so it makes sense to ask if that's what they mean!

▲

verdverm 6 hours ago | parent | prev [-]

so far, we're still learning how to use this new tool, which is also getting better with each release

▲

dude250711 5 hours ago | parent [-]

I agree, it was about 10.29% earlier this year, now we are standing at least at 10.35% or something.

▲

verdverm 5 hours ago | parent [-]

The last one that made the rounds was negative, so we have moved more than 10% in less than 1/2 a year

	▲	eucyclos 3 hours ago \| parent [-]
		That's got to be more about processes around it than the tool itself though, right?