A couple of typical comments about LLMs would be:

"This LLM is able to capably output useful snippets of code for Python. That's useful."

and

"I tried to get an LLM to perform a niche task with a niche language, it performed terribly."

I think the right synthesis is that there are some tasks the LLMs are useful at, some which they're not useful at; practically, it's useful to be able to know what they're useful for.

Or, if we trust that LLMs are useful for all tasks, then it's practically useful to know what they're not good at.

▲

ygritte 17 hours ago | parent | next [-]

Even if that's true, they are still not reliable. The same question can produce different answers each time.

▲

hhh 14 hours ago | parent | next [-]

This isn't really true when you control the stack, no? If you have all of your parameters set to be reproducible (e.g. temp 0, same seed), the output should be the same as long as everything further down the stack is the same, no?

	▲	imiric 13 hours ago \| parent [-]
		That's not a usable workaround. In most cases it doesn't actually produce full determinism[1]. And even if it did, a certain degree of non-determinism is actually desirable. The most probable tokens might not be correct, and randomness is partly responsible for what humans interpret as "creativity". Even hallucinations are desirable in some applications (art, entertainment, etc.). [1]: https://medium.com/google-cloud/is-a-zero-temperature-determ...

▲

jowea 11 hours ago | parent | prev [-]

Is that critical? Doesn't it just need to be better than the alternative? Unless it's a safety-critical system.

▲

imiric 16 hours ago | parent | prev [-]

> Or, if we trust that LLMs are useful for all tasks, then it's practically useful to know what they're not good at.

The thing is that there's no way to objectively measure this. Benchmarks are often gamed, and like a sibling comment mentioned, the output is not stable.

Also, everyone has different criteria for what constitutes "good". To someone with little to no programming experience, LLMs would feel downright magical. Experienced programmers, or any domain expert for that matter, would be able to gauge the output quality much more accurately. Even among the experienced group, there are different levels of quality criteria. Some might be fine with overlooking certain issues, or not bother checking the output at all, while others have much higher standards of quality.

The problem is when any issues that are pointed out are blamed on the user, instead of the tool. Or even worse: when the issues are acknowledged, but are excused as "this is the way these tools work."[1,2]. It's blatant gaslighting that AI companies love to promote for obvious reasons.

[1]: https://news.ycombinator.com/item?id=44483897#44485037

[2]: https://news.ycombinator.com/item?id=44483897#44485366

	▲	rgoulter 16 hours ago \| parent [-]
		> The thing is that there's no way to objectively measure this. Sure. But isn't that a bit like if someone likes VSCode, & someone likes Emacs.. the first method of comparison I'm reaching for isn't "what objective metrics do you have", so much as "how do you use it?". > > This is insulting to smart and capable engineers with many years of experience working with software. > Experienced programmers, or any domain expert for that matter, would be able to gauge the output quality much more accurately. My experience is that smart and capable engineers have varying opinions on things. -- "What their opinion is" is less interesting than "why they have the opinion". I would be surprised, though, if someone were to boast about their experience/skills, & claim they were unable to find any way to use LLMs effectively.