Remix.run Logo
Animats 19 hours ago

There may be an "LLM Winter" as people discover that LLMs can't be trusted to do anything. Look for frantic efforts by companies to offload responsibility for LLM mistakes onto consumers. We've got to have something that has solid "I don't know" and "I don't know how to do this" outputs. We're starting to see reports of LLM usage having negative value for programmers, even though they think it's helping. Too much effort goes into cleaning up LLM messes.

imiric 18 hours ago | parent | next [-]

> Look for frantic efforts by companies to offload responsibility for LLM mistakes onto consumers.

Not just by companies. We see this from enthusiastic consumers as well, on this very forum. Or it might just be astroturfing, it's hard to tell.

The mantra is that in order to extract value from LLMs, the user must have a certain level of knowledge and skill of how to use them. "Prompt engineering", now reframed as "context engineering", has become this practice that separates anyone who feels these tools are wasting their time more than they're helping, and those who feel that it's making them many times more productive. The tools themselves are never the issue. Clearly it's the user who lacks skill.

This narrative permeates blog posts and discussion forums. It was recently reinforced by a misinterpretation of a METR study.

To be clear: using any tool to its full potential does require a certain skill level. What I'm objecting to is the blanket statement that people who don't find LLMs to be a net benefit to their workflow lack the skills to do so. This is insulting to smart and capable engineers with many years of experience working with software. LLMs are not this alien technology that require a degree to use correctly. Understanding how they work, feeding them the right context, and being familiar with the related tools and concepts, does not require an engineering specialization. Anyone claiming it does is trying to sell you something; either LLMs themselves, or the idea that they're more capable than those criticizing this technology.

rightbyte 15 hours ago | parent | next [-]

> Or it might just be astroturfing, it's hard to tell.

Compare the hype for commercial SaaS models to say Deepseek. I think there is an insane amount of astroturfing.

simplyluke 5 hours ago | parent [-]

One of my recurring thoughts reading all kinds of social media posts over the past few years has been to wonder how many of the comments boosting <SPECIFIC NEW LLM RELEASE/TOOL> are being written by AI.

Formulaic, unspecific in results while making extraordinary claims, and always of a specific upbeat tenor.

Culonavirus 4 hours ago | parent [-]

And then on top of that you can't even reply to a post and say it's astroturfey coz it's against them rules (at least it used to be)

rightbyte 4 hours ago | parent [-]

It doesn't work out very well to do unfalsifiable claims about a poster anyways I think. Like claiming someone is a troll.

dmbche 6 hours ago | parent | prev | next [-]

Simple thought I had reading this:

I've used a tool to do a task today. I used a suction sandblasting machine to remove corrosion from a part.

Without the tool, had I wanted to remove the corrosion, I would've spent all day (if not more) scraping it with sandpaper (is that a tool too? With the skin of my hands then?) - this would have been tedious and could have taken me all day, scraping away millimeter by millimeter.

With the machine, it took me about 3 minutes. I necessitated 4-5 minutes of training to attain this level of expertise.

The worth of this machine is undeniable.

How is it that LLMs are not at all so undeniably efficient? I keep hearing people tell me how they will take everyones job, but it seems like the first faceplant from all the big tech companies.

(Maybe second after Meta's VR stuff)

tines 5 hours ago | parent [-]

The difference is that LLMs are not like any other tool. Reasoning by analogy doesn't work when things are sufficiently in-analogous.

For example, people try to compare this LLM tech with the automation of the car manufacturing industry. That analogy is a terrible one, because machines build better cars and are much more reliable than humans.

LLMs don't build better software, they build bad software faster.

Also, as a tool, LLMs discourage understanding in a way that no other tool does.

rgoulter 17 hours ago | parent | prev | next [-]

A couple of typical comments about LLMs would be:

"This LLM is able to capably output useful snippets of code for Python. That's useful."

and

"I tried to get an LLM to perform a niche task with a niche language, it performed terribly."

I think the right synthesis is that there are some tasks the LLMs are useful at, some which they're not useful at; practically, it's useful to be able to know what they're useful for.

Or, if we trust that LLMs are useful for all tasks, then it's practically useful to know what they're not good at.

ygritte 17 hours ago | parent | next [-]

Even if that's true, they are still not reliable. The same question can produce different answers each time.

hhh 14 hours ago | parent | next [-]

This isn't really true when you control the stack, no? If you have all of your parameters set to be reproducible (e.g. temp 0, same seed), the output should be the same as long as everything further down the stack is the same, no?

imiric 12 hours ago | parent [-]

That's not a usable workaround. In most cases it doesn't actually produce full determinism[1].

And even if it did, a certain degree of non-determinism is actually desirable. The most probable tokens might not be correct, and randomness is partly responsible for what humans interpret as "creativity". Even hallucinations are desirable in some applications (art, entertainment, etc.).

[1]: https://medium.com/google-cloud/is-a-zero-temperature-determ...

jowea 11 hours ago | parent | prev [-]

Is that critical? Doesn't it just need to be better than the alternative? Unless it's a safety-critical system.

imiric 16 hours ago | parent | prev [-]

> Or, if we trust that LLMs are useful for all tasks, then it's practically useful to know what they're not good at.

The thing is that there's no way to objectively measure this. Benchmarks are often gamed, and like a sibling comment mentioned, the output is not stable.

Also, everyone has different criteria for what constitutes "good". To someone with little to no programming experience, LLMs would feel downright magical. Experienced programmers, or any domain expert for that matter, would be able to gauge the output quality much more accurately. Even among the experienced group, there are different levels of quality criteria. Some might be fine with overlooking certain issues, or not bother checking the output at all, while others have much higher standards of quality.

The problem is when any issues that are pointed out are blamed on the user, instead of the tool. Or even worse: when the issues are acknowledged, but are excused as "this is the way these tools work."[1,2]. It's blatant gaslighting that AI companies love to promote for obvious reasons.

[1]: https://news.ycombinator.com/item?id=44483897#44485037

[2]: https://news.ycombinator.com/item?id=44483897#44485366

rgoulter 15 hours ago | parent [-]

> The thing is that there's no way to objectively measure this.

Sure. But isn't that a bit like if someone likes VSCode, & someone likes Emacs.. the first method of comparison I'm reaching for isn't "what objective metrics do you have", so much as "how do you use it?".

> > This is insulting to smart and capable engineers with many years of experience working with software.

> Experienced programmers, or any domain expert for that matter, would be able to gauge the output quality much more accurately.

My experience is that smart and capable engineers have varying opinions on things. -- "What their opinion is" is less interesting than "why they have the opinion".

I would be surprised, though, if someone were to boast about their experience/skills, & claim they were unable to find any way to use LLMs effectively.

mumbisChungo 18 hours ago | parent | prev | next [-]

The more I learn about prompt engineering the more complex it seems to be, but perhaps I'm an idiot.

dmbche 5 hours ago | parent [-]

It's just iterating until you get what you want.

It's gonna seem complex if you don't know the subject and don't know how to do the thing without an LLM.

But it's just trying and trying until you get what you want

16 hours ago | parent | prev | next [-]
[deleted]
cheevly 16 hours ago | parent | prev | next [-]

Unless you have automated fine-tuning pipelines that self-optimize optimize models for your tasks and domains, you are not even close to utilizing LLMs to their potential. But stating that you don’t need extensive, specialized skills is enough of a signal for most of us to know that offering you feedback would be fruitless. If you don’t have the capacity by now to recognize the barrier to entry, experts are not going to take the time to share their solutions with someone unwilling to understand.

ygritte 17 hours ago | parent | prev | next [-]

The sad thing is that it seems to work. Lots of people are falling for the "you're holding it wrong" narrative.

AnimalMuppet 11 hours ago | parent | prev [-]

It's probably not astroturfing, or at least not all astroturfing. At least some software engineers tend to do this. We've seen it before, with Lisp, and then with Haskell. "It doesn't work for you? You just haven't tried it for long enough to become enlightened!" Enthusiastic supporters that assume that if was highly useful for them, it must be for everyone in all circumstances, and anyone who disagrees just hasn't been enlightened yet.

keeda 17 hours ago | parent | prev | next [-]

People can't be trusted to do anything either, which is why we have guardrails and checks and balances and audits. That is why in software, for instance, we have code reviews and tests and monitoring and other best practices. That is probably also why LLMs have made the most headway in software development; we already know how to deal with unreliable workers that are humans and we can simply transfer that knowledge over.

As was discussed on a subthread on HN a few weeks ago, the key to developing successful LLM applications is going to be figuring out how to put in the necessary business-specific guardrails with a fallback to a human-in-the-loop.

lmm 17 hours ago | parent [-]

> People can't be trusted to do anything either, which is why we have guardrails and checks and balances and audits. That is why in software, for instance, we have code reviews and tests and monitoring and other best practices. That is probably also why LLMs have made the most headway in software development; we already know how to deal with unreliable workers that are humans and we can simply transfer that knowledge over.

The difference is that humans eventually learn. We accept that someone who joins a team will be net-negative for the first few days, weeks, or even months. If they keep making the same mistakes that were picked out in their first code review, as LLMs do, eventually we fire them.

keeda 16 hours ago | parent [-]

LLMs may not learn on the fly (yet), but these days they do have some sort of a memory that they automatically bring into their context. It's probably just a summary that's loaded into its context, but I've had dozens of conversations with ChatGPT over the years and it remembers my past discussions, interests and preferences. It has many times connected dots across conversations many months apart to intuit what I had in mind and proactively steered the discussion to where I wanted it to go.

Worst case, if they don't do this automatically, you can simply "teach" them by updating the prompt to watch for a specific mistake (similar to how we often add a test when we catch a bug.)

But it need not even be that cumbersome. Even weaker models do surprisingly well with broad guidelines. Case in point: https://news.ycombinator.com/item?id=42150769

yahoozoo 3 hours ago | parent [-]

Yeah, the memory feature is just a summary of past conversations added to the system prompt.

Buttons840 12 hours ago | parent | prev | next [-]

We need to put the LLMs inside systems that ensure they can only do correct things.

Put an LLM on documentation or man pages. Tell the LLM to output a range of lines, and the system actually looks up those lines and quotes them. The overall effect is that the LLM can do some free-form output, but is expected to provide a citation to support its claims; and the citation can't be hallucinated, since the LLM doesn't generate the citation, a plain old computer program does.

And we haven't seen LLMs integrated with type systems yet. There are very powerful type systems, like dependent types, that can prove things like "this function returns a list of sorted number", and the type system ensures that is ALWAYS true [0], at compile time. You have to write a lot of proof code to help the compiler do these checks at compile time, but if a LLM can write those proofs, we can trust they are correct, because only correct proofs will compile.

[0]: Or rather, almost always true. There's always the possibility of running out of memory or the power goes out.

NoGravitas 11 hours ago | parent | next [-]

I think that if LLMs have any future, it is this. The LLM will only be a user interface to a system that on the back end is deterministic and of consistent quality, i.e., a plain old computer program.

digianarchist 11 hours ago | parent | prev [-]

Are models capable of generating citations? Every time I've asked for citations on ChatGPT they either don't exist or are incorrect.

Buttons840 10 hours ago | parent [-]

They can't pull citations out of their own weights, but if you give them tools to look up man pages (possibly annotated with line numbers), they could cite the lines that support their claims.

mtlmtlmtlmtl 18 hours ago | parent | prev [-]

Yeah, I can't wait for this slop generation hype circlejerk to end either. But in terms of being used by people who don't care about quality, like scammers, spammers, blogspam grifters, people trying to affect elections by poisoning the narrative, people shitting out crappy phone apps, videos, music, "art" to grift some ad revenue, gen AI is already the perfect product. Once the people who do care wake up and realise gen AI is basically useless to them, the internet will already be dead, we'll be in a post-truth, post-art, post-skill, post-democracy world and the only people whose lives will have meaningfully improved are some billionaires in california who added some billions to their net worth.

It's so depressing to watch so many smart people spend their considerable talents on the generation of utter garbage and the erosion of the social fabric of society.