Remix.run Logo
dbingham a day ago

It feels like these new models are no longer making order of magnitude jumps, but are instead into the long tail of incremental improvements. It seems like we might be close to maxing out what the current iteration of LLMs can accomplish and we're into the diminishing returns phase.

If that's the case, then I have a bad feeling for the state of our industry. My experience with LLMs is that their code does _not_ cut it. The hallucinations are still a serious issue, and even when they aren't hallucinating they do not generate quality code. Their code is riddled with bugs, bad architectures, and poor decisions.

Writing good code with an LLM isn't any faster than writing good code without it, since the vast majority of an engineer's time isn't spent writing -- it's spent reading and thinking. You have to spend more or less the same amount of time with the LLM understanding the code, thinking about the problems, and verifying its work (and then reprompting or redoing its work) as you would just writing it yourself from the beginning (most of the time).

Which means that all these companies that are firing workers and demanding their remaining employees use LLMs to increase their productivity and throughput are going to find themselves in a few years with spaghettified, bug-riddled codebases that no one understands. And competitors who _didn't_ jump on the AI bandwagon, but instead kept grinding with a strong focus on quality will eat their lunches.

Of course, there could be an unforeseen new order of magnitude jump. There's always the chance of that and then my prediction would be invalid. But so far, what I see is a fast approaching plateau.

noduerme a day ago | parent | next [-]

Wouldn't that be the best thing possible for our industry? Watching the bandwagoners and "vibe coders" get destroyed and come begging for actual thinking talent would be delicious. I think the bets are equal on whether later LLMs can unfuck current LLM code to the degree that no one needs to be re-hired... but my bet is on your side, that bad code collapses under its own weight. As does bad management in thrall to trends whose repercussions they don't understand. The scenario you're describing is almost too good. It would be a renaissance for the kind of thinking coders you're talking about - those of us who spend 90% of our time considering how to fit a solution to a domain and a specific problem - and it would scare the hell out of the next crop of corner suite assholes, essentially enshrining the belief that only smart humans can write code that performs on the threat/performance model needed to deal with any given problem.

>> the vast majority of an engineer's time isn't spent writing -- it's spent reading and thinking.

Unfortunately, this is now an extremely minority understanding of how we need to do our job - both among hirees and the people who hire them. You're lucky if you can find an employer who understands the value of it. But this is what makes a "10x coder". The unpaid time spent lying awake in bed, sleepless until you can untangle the real logic problems you'll have to turn into code the next day.

csomar 20 hours ago | parent [-]

That's not how real life works; you are thinking of a movie. Management will never let down of any power they accumulated until the place is completely ransacked. The Soviet Union is a cautionary tale, a relatively modern event and well documented.

noduerme 19 hours ago | parent [-]

I only work for companies where I have direct interaction with the owners. But I think that any business structure that begins to resemble a "soviet" type, where middle management accumulates all the power (and is scared of workers who have ideas) is inevitably going to collapse. If the way they try in the late 2020s to accumulate power is by replacing thoughtful coders with LLMs, they will collapse in a very dramatic, even catastrophic fashion. Which will be very funny to me. And it will result in their replacement, and the reinstatement of thoughtful code design.

A lot of garbage will have to be rewritten and a lot of poorly implemented logic re-thought. Again, I think a hard-learned lesson is in order, and it will be a great thing for our industry.

agoodusername63 a day ago | parent | prev | next [-]

I think theres still lots of room for huge jumps in many metrics. It feels like not too long ago that DeepSeek demonstrated that there was value in essentially recycling (Stealing, depending on your view) existing models into new ones to achieve 80% of what the industry had to offer for a fraction of the operating cost.

Researchers are still experimenting, I haven't given up hope yet that there will be multiple large discoveries that fundamentally change how we develop these LLMs.

I think I agree with the idea that current common strategies are beginning to scrape the bottom of the barrel though. We're starting to slow down a tad.

adamtaylor_13 a day ago | parent | prev | next [-]

That’s funny, my experience has been the exact opposite.

Claude Code has single-handedly 2-3x my coding productivity. I haven’t even used Claude 4 yet so I’m pretty excited to try it out.

But even trusty ol 3.7 is easily helping me out out 2-3x the amount of code I was before. And before anyone asks, yes it’s all peer-reviewed and I read every single line.

It’s been an absolute game changer.

Also to your point about most engineering being thinking: I can test 4-5 ideas in the time it took me to test a single idea in the last. And once you find the right idea, it 100% codes faster than you do.

runekaagaard 19 hours ago | parent [-]

Yeah remember when people were using Claude 3.7... so oldschool man

icpmacdo a day ago | parent | prev | next [-]

"It feels like these new models are no longer making order of magnitude jumps, but are instead into the long tail of incremental improvements. It seems like we might be close to maxing out what the current iteration of LLMs can accomplish and we're into the diminishing returns phase."

SWE bench from ~30-40% to ~70-80% this year

elcritch a day ago | parent | next [-]

Yet despite this all the LLMS I've tried struggle to scale beyond much more than a single module. They're vast improvements on that test perhaps, but in real life they still struggle to be coherent over larger projects and scales.

bckr 10 hours ago | parent | next [-]

> struggle to scale beyond much more than a single module

Yes. You must guide coding agents at the level of modules and above. In fact, you have to know good coding patterns and make these patterns explicit.

Claude 4 won’t use uv, pytest, pydantic, mypy, classes, small methods, and small files unless you tell it to.

Once you tell it to, it will do a fantastic job generating well-structured, type-checked Python.

viraptor 21 hours ago | parent | prev [-]

Those are different kind of issues. Improving the quality of actions is what we're seeing here. Then for the larger projects/contexts the leaders will have to battle it out between the improved agents, or actually moving to something like RWKV and processing the whole project in one go.

morsecodist 21 hours ago | parent [-]

They may be different kinds of issues but they are the issues that actually matter.

piperswe a day ago | parent | prev | next [-]

How much of that is because the models are optimizing specifically for SWE bench?

icpmacdo a day ago | parent [-]

not that much because its getting better at all benchmarks

a day ago | parent [-]
[deleted]
keeeba 18 hours ago | parent | prev | next [-]

https://arxiv.org/abs/2309.08632

avs733 a day ago | parent | prev [-]

3% to 40% is a 13x improvement

40% to 80% is a 2x improvement

It’s not that the second leap isn’t impressive, it just doesn’t change your perspective on reality in the same way.

viraptor 21 hours ago | parent | next [-]

Maybe... It will be interesting to see the improvements now compared to other benchmarks. Is 80->90% going to be an incremental fix with minimal impact on the next benchmark (same work but better), or is it going to be an overall 2x improvement on the remaining unsolved cases. (different approach tackling previously missed areas)

It really depends on how that remaining improvement happens. We'll see it soon though - every benchmark nearing 90% is being replaced with something new. SWE-verified is almost dead now.

energy123 21 hours ago | parent | prev | next [-]

80% to 100% would be an even smaller improvement but arguably the most impressive and useful (assuming the benchmark isn't in the training data)

andyferris 21 hours ago | parent | prev [-]

I wouldn’t want to wait ages for Claude Code to fail 60% of the time.

A 20% risk seems more manageable, and the improvements speak to better code and problem solving skills around.

hodgehog11 21 hours ago | parent | prev | next [-]

Under what metrics are you judging these improvements? If you're talking about improving benchmark scores, as others have pointed out, those are increasing at a regular rate (putting aside the occasional questionable training practices where the benchmark is in the training set). But most individuals seem to be judging "order of magnitude jumps" in terms of whether the model can solve a very specific set of their use cases to a given level of satisfaction or not. This is a highly nonlinear metric, so changes will always appear to be incremental until suddenly it isn't. Judging progress in this way is alchemy, and leads only to hype cycles.

Every indication I've seen is that LLMs are continuing to improve, each fundamental limitation recognized is eventually overcome, and there are no meaningful signs of slowing down. Unlike prior statistical models which have fundamental limitations without solutions, I have not seen evidence to suggest that any particular programming task that can be achieved by humans cannot eventually be solvable by LLM variants. I'm not saying that they necessarily will be, of course, but I'd feel a lot more comfortable seeing evidence that they won't.

morsecodist 21 hours ago | parent [-]

I think it actually makes sense to trust your vibes more than benchmarks. The act of creating a benchmark is the hard part. If we had a perfect benchmark AI problems would be trivially solvable. Benchmarks are meaningless on their own, they are supposed to be a proxy for actual usefulness.

I'm not sure what is better than, can it do what I want? And for me the ratio of yes to no on that hasn't changed too much.

morsecodist 21 hours ago | parent | prev | next [-]

I agree on the diminishing returns and that the code doesn't cut it on its own. I really haven't noticed a significant shift in quality in a while. I disagree on the productivity though.

Even for something like a script to do some quick debugging or answering a question it's been a huge boon to my productivity. It's made me more ambitious and take on projects I wouldn't have otherwise.

I also don't really believe that workers are currently being replaced by LLMs. I have yet to see a system that comes anywhere close to replacing a worker. I think these layoffs are part of a trend that started before the LLM hype and it's just a convenient narrative. I'm not saying that there will be no job loss as a result of LLMs I'm just not convinced it's happening now.

csomar 20 hours ago | parent | prev | next [-]

> And competitors who _didn't_ jump on the AI bandwagon, but instead kept grinding with a strong focus on quality will eat their lunches.

If the banking industry is any clue they'll get bailout from the government to prevent a "systemic collapse". There is a reason "everyone" is doing it especially with these governments. You get to be cool, you don't risk of missing out and if it blows, you let it blow on the tax payer expense. The only real risk for this system is China because they can now out compete the US industries.

sublimefire 16 hours ago | parent | prev | next [-]

There are a couple of things where LLMs are OK from the business perspective. Even if they are so so you can still write large amounts of mediocre code without the need to consume libraries. Think about GPL’d code, no need to worry about that because one dev can rewrite those libraries into proprietary versions without licensing constraints. Another thing is that LLMs are OK for an average company with few engineers that need to ship mountains of code across platforms, they would make mistakes anyway so LLMs should not make it worse.

Horffupolde a day ago | parent | prev | next [-]

So you abandon university because you don’t make order of magnitude progress between semesters. It’s only clear in hindsight. Progress is logarithmic.

Davidzheng 21 hours ago | parent | prev | next [-]

Disagree. The marginal returns are more in places where the LLMs are near skill ceilings.

a day ago | parent | prev [-]
[deleted]