They are not getting worse, they are getting better. You just haven't figured out the scaffolding required to elicit good performance from this generation. Unit tests would be a good place to start for the failure mode discussed.

As others have noted, the prompt/eval is also garbage. It’s measuring a non-representative sub-task with a weird prompt that isn’t how you’d use agents in, say, Claude Code. (See the METR evals if you want a solid eval giving evidence that they are getting better at longer-horizon dev tasks.)

This is a recurring fallacy with AI that needs a name. “AI is dumber than humans on some sub-task, therefore it must be dumb”. The correct way of using these tools is to understand the contours of their jagged intelligence and carefully buttress the weak spots, to enable the super-human areas to shine.

▲

frizlab a day ago | parent | next [-]

So basically “you’re holding it wrong?”

▲

dannersy a day ago | parent | next [-]

Every time this is what I'm told. The difference between learning how to Google properly and then the amount of hoops and in-depth understanding you need to get something useful out of these supposedly revolutionary tools is absurd. I am pretty tired of people trying to convince me that AI, and very specifically generative AI, is the great thing they say it is.

It is also a red flag to see anyone refer to these tools as intelligence as it seems the marketing of calling this "AI" has finally sewn its way into our discourse that even tech forums think the prediction machine is intelligent.

▲

conception a day ago | parent | next [-]

I heard it best described to me that if you put in an hour of work, you get five hours of work out of it. Most people just type at it and don’t put in an hour of planning and discussion and scaffolding. They just expect it to work 100% of the time exactly like they want. But you wouldn’t expect that from a junior developer you would put an hour of work into them, teaching them things showing them where the documentation is your patterns how you do things and then you would set them off and they would probably make mistakes and you would document their mistakes for them so they wouldn’t make them again, but eventually, they’d be pretty good. That’s more or less where we are today that will get you success on a great many tasks.

	▲	wnolens 14 hours ago \| parent [-]
		Exactly my experience and how I leverage Claude where some of my coworkers remain unconvinced.

▲

danielbln a day ago | parent | prev [-]

"The thing I've learned years ago that is actually complex but now comes easy to me because I take my priors for granted is much easier than the new thing that just came out"

Also, that "it's not really intelligence" horse is so dead, it has already turned into crude oil.

▲

dannersy 19 hours ago | parent | next [-]

The point I am making is that this is supposed to be some revolutionary tool that threatens our very society in terms of labor and economics yet the fringe enthusiasts (yes, that is what HN and its users are, an extreme minority of users), and the very people plugged into the weekly changes and additions of model adjustments and tools to leverage them still struggle to show me the value of generative AI day to day. They make big claims, but I don't see them. In fact, I see negatives overwhelming the gains which goes without talking of the product and its usability.

In practice I have seen: flowery emails no one bothers to read, emoji filled summaries and documentation that no one bothers to read or check correctness on, prototypes that create more work for devs in the long run, a stark decline in code quality because it turns out reviewing code is a team's ultimate test of due diligence, ridiculous video generation... I could go on and on. It is blockchain all over again, not in terms of actual usefulness, but in terms of our burning desire to monetize it in irresponsible, anti-consumer, anti-human ways.

I DO have a use for LLMs. I use it to tag data that has no tagging. I think the tech behind generative AI is extremely useful. Otherwise, what I see is a collection of ideal states that people fail to demonstrate to me in practice when in reality, it wont be replacing anyone until "the normies" can use it without 1000 lines of instructions markdown. Instead it will just fool people in its casual authoritative and convincing language since that it was it was designed to do.

	▲	bojan 18 hours ago \| parent [-]
		> reviewing code is a team's ultimate test of due diligence Further even, if you are actually thinking about long-term maintenance during the code review you get seen as a nitpicky obstacle.

▲

frizlab a day ago | parent | prev [-]

> Also, that "it's not really intelligence" horse is so dead, it has already turned into crude oil.

Why? Is it intelligence now? I think not.

▲

danielbln a day ago | parent [-]

Would you mind defining "intelligence" for me?

	▲	Terr_ a day ago \| parent \| next [-]
		If you're the one saying it exists, you go first. :p
	▲	frizlab a day ago \| parent \| prev [-]
		There are many types of intelligence. If you want to go to useless places, using certain definitions of intelligence, yes, we can consider AI “intelligent.” But it’s useless.

▲

theptip a day ago | parent | prev | next [-]

I’d say “skill issue” since this is a domain where there are actually plenty of ways to “hold it wrong” and lots of ink spilled on how to hold it better, and your phrasing connotes dismissal of user despair which is not my intent.

(I’m dismissive of calling the tool broken though.)

▲

Workaccount2 a day ago | parent | prev | next [-]

Remember when "Googling" was a skill?

LLMs are definitely in the same boat. It's even more specific where different models have different quirks so the more time you spend with one, the better the results you get from that one.

▲

dude250711 a day ago | parent [-]

Those skills will age faster than Knockout.js.

	▲	petesergeant a day ago \| parent [-]
		Why would a skill that's being actively exercised against the state of the art, daily, age poorly?

▲

steveklabnik a day ago | parent | prev | next [-]

Do you think it's impossible to ever hold a tool incorrectly, or use a tool in a way that's suboptimal?

▲

mrguyorama a day ago | parent [-]

If that tool is sold as "This magic wand will magically fix all your problems" then no, it's not possible to hold it incorrectly.

▲

orangecat a day ago | parent | next [-]

If your position is that any product that doesn't live up to all its marketing claims is worthless, you're going to have a very limited selection.

▲

steveklabnik a day ago | parent | prev | next [-]

Gotcha. I don't see these tools as being a magic wand nor being able to magically fix every problem. I agree that anyone who sells them that way is overstating their usefulness.

▲

wvenable a day ago | parent | prev [-]

Why does it matter how it's sold? Unless you're overpaying for what it's actually capable of, it doesn't really matter.

▲

callc a day ago | parent [-]

We all have skin in the game when how it’s sold is “automated intelligence so that we can fire all our knowledge workers”

Might be good in some timelines. In our current timeline this will just mean even more extreme concentration of wealth, and worse quality of life for everyone.

Maybe when the world has a lot more safety nets so that not having a job doesn’t mean homelessness, starvation, no healthcare, then society will be more receptive to the “this tool can replace everybody” message.

▲

wvenable a day ago | parent [-]

If a machine can do your job; whether it's harvesting corn or filing a TPS report then making a person sit and do it for the purpose of survival is basically just torture.

There are so many better things for humans to do.

▲

callc 9 hours ago | parent | next [-]

I agree in theory. In practice people who are automated out of jobs are not taken care of by society in the transition period where they learn how to do a new job.

Once having a job is not intimately tied to basic survival needs then people will be much more willing to automate everything.

I, personally, would be willing to do mind numbing paperwork or hard labor if it meant I could feed myself and my family, have housing, rather than be homeless and starving.

	▲	wvenable 9 hours ago \| parent [-]
		You might as well stop being a software developer. Not because you'll be out of job, but because you're directly contributing to other people being out of jobs. We've been automating work (which is ultimately human labor) since the dawn of computers. And humans have been automating work for centuries now. We actually call that progress. So lets stop progressing entirely so people can do pointless labor. If the problem is with society the solution is with society. We have stop pretending that it's anything else. AI is not even the biggest technological leap -- it's blip on the continuum.

▲

pixl97 7 hours ago | parent | prev [-]

>There are so many better things for humans to do.

For the time being, at least.

▲

wvenable 6 hours ago | parent [-]

There will always be better things for people to do. We don't exist on this planet just to sit at a desk and hit buttons all day.

▲

pixl97 6 hours ago | parent [-]

The only reason we exist is as a carrier for our genes to make more of our genes. Everything after that is an accidental byproduct.

	▲	wvenable 5 hours ago \| parent [-]
		I can already think of something more useful to pass on my genes than typing on keyboard all day.

▲

greggoB a day ago | parent | prev [-]

I found this a pretty apt - if terse - reply. I'd appreciate someone explaining why it deserves being downvoted?

▲

conception 12 hours ago | parent | next [-]

It’s just dismissive of the idea that you have to learn how use LLMs vs a design flaw in a cell phone that was dismissed as user error.

It’s the same as if he had said “I keep typing HTML into VS code and it keeps not displaying it for me. It just keeps showing the code. But it’s made to make webpages, right? people keep telling me I don’t know how to use it but it’s just not showing me the webpage.”

▲

mostlysimilar a day ago | parent | prev | next [-]

There are two camps who have largely made up their minds just talking past each other, instinctively upvoting/downvoting their camp, etc. These threads are nearly useless, maybe a few people on the fringes change their minds but mostly it's just the same tired arguments back and forth.

▲

a day ago | parent | prev | next [-]

[deleted]

▲

hug a day ago | parent | prev | next [-]

Because in its brevity it loses all ability to defend itself from any kind of reasonable rebuttal. It's not an actual attempt to continue the conversation, it's just a semantic stop-sign. It's almost always used in this fashion, not just in the context of LLM discussions, but in this specific case it's particularly frustrating because "yes, you're holding it wrong" is a good answer.

To go further into detail about the whole thing: "You're holding it wrong" is perfectly valid criticism in many, many different ways and fields. It's a strong criticism in some, and weak in others, but almost always the advice is still useful.

Anyone complaining about getting hurt by holding a knife by the blade, for example, is the strongest example of the advice being perfect. The tool is working as designed, cutting the thing with pressure on the blade, which happens to be their hand.

Left-handers using right-handed scissors provides a reasonable example: I know a bunch of left-handers who can cut properly with right-handed scissors and not with left-handed scissors. Me included, if I don't consciously adjust my behaviour. Why? Because they have been trained to hold scissors wrong (by positioning the hand to create opposite push/pull forces to natural), so that they can use the poor tool given to them. When you give them left-handed scissors and they try to use the same reversed push/pull, the scissors won't cut well because their blades are being separated. There is no good solution to this, and I sympathise with people stuck on either side of this gap. Still, learn to hold scissors differently.

And, of course, the weakest, and the case where the snark is deserved: if you're holding your iPhone 4 with the pad of your palm bridging the antenna, holding it differently still resolves your immediate problem. The phone should have been designed such that it didn't have this problem, but it does, and that sucks, and Apple is at fault here. (Although I personally think it was blown out of proportion, which is neither here nor there.)

In the case of LLMs, the language of the prompt is the primary interface -- if you want to learn to use the tool better, you need to learn to prompt it better. You need to learn how to hold it better. Someone who knows how to prompt it well, reading the kind of prompts the author used, is well within their rights to point out that the author is prompting it wrong, and anyone attempting to subvert that entire line of argument with a trite little four-sentence bit of snark in whatever the total opposite of intellectual curiosity is deserves the downvotes they get.

▲

frizlab 20 hours ago | parent [-]

Except this was posted because the situation is akin to the original context in which this phrase was said.

Initial postulate: you have a perfect tool that anybody can use and is completely magic.

Someone says: it does not work well.

Answer: it’s your fault, you’re using it wrong.

In that case it is not a perfect tool that anybody can use. It is just yet another tool, with it flaws and learning curve, that may or may not work depending on the problem at hand. And it’s ok! It is definitely a valid answer. But the “it’s magic” narrative has got to go.

	▲	pixl97 7 hours ago \| parent [-]
		>Initial postulate: you have a perfect tool that anybody can use and is completely magic. >Someone says: it does not work well. Why do we argue with two people that are both building strawmen. It doesn't accomplish much. We keep calling AI 'unintelligent' but peoples eagar willingness to make incorrect arguments does put some doubts on humanity itself.

▲

Leynos a day ago | parent | prev [-]

It's something of a thought terminating cliché in Hacker News discussions about large language models and agentic coding tools.

▲

data-ottawa a day ago | parent | prev | next [-]

Needing the right scaffolding is the problem.

Today I asked 3 versions of Gemini “what were sales in December” with access to a sql model of sales data.

All three ran `WHERE EXTRACT(MONTH FROM date) = 12` with no year (except 2.5 flash did sometimes gave me sales for Dec 2023).

No sane human would hear “sales from December” and sum up every December. But it got numbers that an uncritical eye would miss being wrong.

That’s the type of logical error that these models produce that are bothering the author. They can be very poor at analysis in real world situations because they do these things.

▲

techblueberry a day ago | parent | prev | next [-]

"They are not getting worse, they are getting better. You just haven't figured out the scaffolding required to elicit good performance from this generation. Unit tests would be a good place to start for the failure mode discussed."

Isn't this the same thing? I mean this has to work with like regular people right?

▲

khalic 17 hours ago | parent | prev | next [-]

I’ve seen some correlation between people who write clean and structured code, follow best practices and communicate well through naming and sparse comments, and how much they get out of LLM coding agents. Eloquence and depth of technical vocabulary seem to be a factor too.

Make of that what you will…

▲

Garlef a day ago | parent | prev | next [-]

I'm referring to these kind of articles as "Look Ma, I made the AI fail!"

	▲	falloutx a day ago \| parent [-]
		Still I would agree we need some of these articles when other parts of the internet is "AI can do everything, sign up for my coding agent for $200/month"

▲

ashleyn a day ago | parent | prev | next [-]

Having to prime it with more context and more guardrails seems to imply they're getting worse. That's fewer context and guardrails it can infer/intuit.

	▲	theptip a day ago \| parent \| next [-]
		No, they are not getting worse. Again, look at METR task times. The peak capability is very obviously, and objectively, increasing. The scaffolding you need to elicit top performance changes each generation. I feel it’s less scaffolding now to get good results. (Lots of the “scaffolding” these days is less “contrived AI prompt engineering” and more “well understood software engineering best practices”.)
	▲	falloutx a day ago \| parent \| prev [-]
		Why the downvotes, this comment makes sense. If you need to write more guardrails that does increase the work and at some point amount of guardrails needed to make these things work in every case would be just impractical. I personally dont want my codebase to be filled baby sitting instructions for code agents.

▲

clownpenis_fart a day ago | parent | prev [-]

[dead]