Smart caching fixes a lot of the issues there. If a fork is marked somehow as successful then presumably the cache lookup next time will be less painful/costly.

Of course that's dependent on how caching gets implemented/where/when/how, but it's not unsolvable for common occurrence questions/answers.

As for getting the SOTA questions wrong : we as humans would likely also go through an iterative feedback loop until initial success and experience, too.

▲

yahoozoo 3 days ago | parent [-]

Except LLMs aren’t humans. Why do we continue to say “well, yeah but humans <do this same imperfect thing>”? A major point/goal is for these to be better and less error prone than humans. It’s just coping at this point.

▲

ben_w 3 days ago | parent | next [-]

That goes too far in the opposite direction.

Humans come with a broad range of skills and performance; LLMs are inside this range.

The fact LLMs are not human, and the fact that the best humans beat them, is as economically relevant as the fact that a ride-on lawnmower isn't human and (typically) an athlete can outrace them — i.e. it resolves to what you're actually using them for.

▲

zelphirkalt 3 days ago | parent [-]

But it is not merely the best humans. Any good developer is able to write better code, because by definition LLMs tend towards the mean, which is mediocre code, mostly from GitHub, they were force-fed as training data.

They may excel at solving very narrow problems with decent results, like in that programming competition recently. But those are indeed very narrowly defined problems, and while they may solve it decently in limited time, that is roughly their overall limit, while a human, given more time, can excel to a much higher level.

It becomes a question of whether we want mediocre things, that are not very extensible and maintainable, relying on the very thing that produced these mediocre codes to maintain and extend them, or do we want high quality work.

For the latter one would want to hire qualified people. Too bad though, that hiring is broken at many companies and they don't recognize qualifications, when right in front of them.

▲

ben_w 3 days ago | parent [-]

I suspect we're not in strong disagreement here, because you recognise that not all humans are equal, and that some are indeed worse than LLMs. But:

> because by definition LLMs tend towards the mean

This part is false: the mean human can't write code at all. Also, as per your own point:

> They may excel at solving very narrow problems with decent results, like in that programming competition recently.

LLMs are often in the top decile of coding challenges, which are already limited to better-than-average developers. Now, these same models that get top decile scores in challenges are still not in the top decile overall because the role of software developer is much broader than just leetcode, but this still demonstrates the point: LLMs do not tend towards the mean.

> But those are indeed very narrowly defined problems, and while they may solve it decently in limited time, that is roughly their overall limit, while a human, given more time, can excel to a much higher level.

Except "code" is itself not narrowly-defined even despite what I just said. Even within one programming language, comprehension of the natural language task description is itself much harder and more general than any programming language, and both the programming language and all the libraries are described in a mixture of natural and formal language. Even just the ability to recognise if it's looking at examples of C or JavaScript is something it had to learn rather than being explicitly programmed with knowledge of.

Now sure, I will absolutely say that if the working definition of "intelligence" is about how few examples are needed to learn a new thing, then transformer models are "stupid". But, to a certain degree, they're able to making up for being very very stupid by being very very stupid very very quickly and very very cheaply — cheap enough and fast enough that when you do hit their skill limits, there's many cases where one can afford to boost them a noticeable degree, and it's affordable even though every m-times-n-quality-points you need to boost them by comes with 2^n increase in their cost in both time and money.

Not always, and it's an exponential cost per linear improvement, but often.

▲

zelphirkalt 3 days ago | parent [-]

> LLMs are often in the top decile of coding challenges, which are already limited to better-than-average developers. Now, these same models that get top decile scores in challenges are still not in the top decile overall because the role of software developer is much broader than just leetcode, but this still demonstrates the point: LLMs do not tend towards the mean.

Like I said: Very narrowly defined problems, yes they can excel at it.

But sometimes they don't even excel at that. Every couple of months I try to make LLMs write a specific function, but neither did they succeed in January, nor did they succeed a few weeks ago. Basically, zero progress in their capability of following instructions regarding the design of the function. They cannot think and as soon as something is rare in their training data, or even non-existent, they fail utterly. Even direct instructions like "do not make use of the following functions ..." they disregard, because they cannot help themselves with the data they were trained on. And before you ask: I tried this on recent Qwen Coder, Mistral 3.1, ChatGPT, and someone else tried it for me on Claude-something. None of them did any better. All incapable of doing it. If the solution is in their training data, its signal is so weak, that they never consider it.

This leads me to question, how much shit code they introduce to solve a narrowly defined problem like in coding competitions.

	▲	ben_w 2 days ago \| parent [-]
		> Like I said: Very narrowly defined problems, yes they can excel at it. See next paragraph.

▲

MattGaiser 3 days ago | parent | prev | next [-]

That isn't needed, as LLMs are way cheaper. Even if they never advance beyond strong new grad, their cheapness is enormous value add on its own. GitHub will presently sell you 1500 AI generated PRs a month for $40. You used to have to pay a human 10K a month, even if it was small stuff.

All kinds of software that are worth as little as 10K a year are now worth building, as making and supporting them is so trivial.

	▲	zelphirkalt 3 days ago \| parent [-]
		How many of these things have you developed and are maintaining this way?

▲

Wowfunhappy 3 days ago | parent | prev [-]

Because humans take longer than Claude, and most of them want to be paid more than $200 per month. I don't just have access to another human at my beck and call at all hours.