"the intelligence is clearly there"

I wonder if I am using the same models as everyone else. To me, LLMs still give good answers 80% of the time, but 20% it fails in such a miserable way that makes it obvious that the "intelligence" is not there.

▲

coldtea 2 hours ago | parent | next [-]

It might be extra demand for rigor that's not equally applied to humans. One could argue that other coders in our teams, or even ourselves, often fail in "a miserable way", say about 20% of the time. But we block this out, or consider it "regular functioning", or just a one-off based on something we got wrong, "just a try" we redo, etc.

But when an LLM does it on an area we know, we notice and suddenly it's too much.

▲

girvo 21 minutes ago | parent | next [-]

> But when an LLM does it on an area we know, we notice and suddenly it's too much.

Well of course. The owners of the companies building this are constantly talking about it replacing us all. Why would it be surprising that it would then be held to a higher standard?

	▲	coldtea a few seconds ago \| parent [-]
		Because it doesn't need to match a higher standard to "replace us all". It's enough that it works on the same standard, or even a lesser one, but for cheaper, with no complaints, and 24/7.

▲

nibbleyou 34 minutes ago | parent | prev [-]

Because a human fails in a known way. If a human does not have expertise in domain X or tech Y, they will fail there and the expectation is that they will fail.

With an LLM you never know where it can fail. There is no domain expertise for an LLM. It can fail in a miserable way in the same domain it worked spectacularly for.

▲

21asdffdsa12 3 hours ago | parent | prev | next [-]

It really depends on the field you are in and the tasks you set and how much of it was in the training set? A webdeveloper will find it succeeding in all taks - while some c++ exotic physics simulation developer will find it lacking.

The "works for me" is telling more about the field of the LLM reviewer, then the LLM.

▲

wolvesechoes 2 hours ago | parent [-]

> while some c++ exotic physics simulation developer will find it lacking

Can confirm, but I always read I am holding it wrong.

▲

20k an hour ago | parent | next [-]

I've consistently tried to apply LLMs to physics problems and they're utterly useless. They'll just confidently lie, or blatantly plagiarise source materials

The issue is once you hit niche physics simulations there simply isn't any training data available, so the limitations of them become incredibly apparent. Its also problematic because a field itself will contain lots of wrong information (its research!), and AI picks all this up uncritically

I thought I'd give chatgpt a quick spin on my favourite question, which is "is the adm formalism strictly equivalent to general relativity", to which it consistently gives the wrong answer

>Ah, now you’re hitting the subtlety head-on—that’s exactly where the “strict equivalence” claim needs nuance. Let’s unpack this carefully.

I don't know how anyone can stand these tools. Its just an obnoxious glazing machine that tells me I'm a genius consistently

Gemini gives a little more of a robust answer, but fails catastrophically for the question "is the bssn formalism numerically stable", where just about the entire answer is completely wrong from top to bottom. It certainly looks convincing. Its got all the right terminology. It manages to piece together the right set of words, but all the informational content is wrong, which isn't exactly a small problem

I struggle to see how these tools are of any use

	▲	sofixa an hour ago \| parent \| next [-]
		That's why there are companies specialising in AI for physics, like Emmi AI (now part of Mistral). If BMW and Airbus go on stage to talk about how they're using it for their physics simulations, it's probably at least decent.
	▲	otabdeveloper4 24 minutes ago \| parent \| prev [-]
		> confidently lie, or blatantly plagiarise Good enough for enterprise work tho. (Also the secret sauce to "holding LLMs right".)

▲

OtomotO an hour ago | parent | prev [-]

You're not. People are just using a hammer to build a shed and telling you it's surely good to dig a hole too.

▲

hodgehog11 an hour ago | parent | prev | next [-]

I get about the same success rate with my problems (scientific computing usually), but they're often _much_ easier to check than to write, so an 80% success rate becomes game-changing.

▲

an hour ago | parent | prev | next [-]

[deleted]

▲

an hour ago | parent | prev [-]

[deleted]