Remix.run Logo
bandrami 4 hours ago

But we don't even seem to be getting faster time-to-ship in any way that anybody can actually measure; it's always some vague sense of "we're so much more productive".

59nadir 4 hours ago | parent [-]

That's a fair observation and one that I don't really have an answer for. I can say from personal experience that I believe that shipping nonsense code has never been faster. That's just an anecdote, obviously.

We need a bigger version of the METR study on perceived vs. real productivity[0], I guess. It's a thankless job, though, since people will assume/state even at publication time that "Everything has progressed so much, those models and agents sucked, everything is 10 times better now!" and you basically have to start a new study, repeat ad infinitum.

One problem that really complicates things is that the net competency of these models seems really spotty and uneven. They're apparently out here solving math problems that seemingly "require thinking", but at the same time will write OpenGL code that will produce black screens on basically every driver, not produce the intended results and result in hours of debugging time for someone not familiar enough. That's despite OpenGL code being far more prevalent out there than math proofs, presumably. How do you reliably even theorize about things like this when something can be so bad and (apparently) so good at the same time?

0 - https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...