These tests are looking increasingly like a waste of time.

The "intelligence" is clearly there now. Trying to measure it seems pointless. I can't shop for hammers at the hardware store and sort by the quality of finished products they would produce. That is clearly an insane ask, but that's approximately what is being pushed for with these models now.

Domain specificity (harness & environment) is where the magic happens next. I intentionally use a slightly less powerful model to help reveal weakness in how I've exposed the domain to the model. Having capability reserves available dramatically increases confidence around a project like this. If the customer starts to complain about some edges, I can crank them up to gpt5.5 for target scenarios. If I'm already on 5.5 there's nowhere else to go. I'm up against the wall.

▲

gcgbarbosa 3 hours ago | parent | next [-]

"the intelligence is clearly there"

I wonder if I am using the same models as everyone else. To me, LLMs still give good answers 80% of the time, but 20% it fails in such a miserable way that makes it obvious that the "intelligence" is not there.

▲

coldtea 2 hours ago | parent | next [-]

It might be extra demand for rigor that's not equally applied to humans. One could argue that other coders in our teams, or even ourselves, often fail in "a miserable way", say about 20% of the time. But we block this out, or consider it "regular functioning", or just a one-off based on something we got wrong, "just a try" we redo, etc.

But when an LLM does it on an area we know, we notice and suddenly it's too much.

▲

girvo 22 minutes ago | parent | next [-]

> But when an LLM does it on an area we know, we notice and suddenly it's too much.

Well of course. The owners of the companies building this are constantly talking about it replacing us all. Why would it be surprising that it would then be held to a higher standard?

	▲	coldtea 2 minutes ago \| parent [-]
		Because it doesn't need to match a higher standard to "replace us all". It's enough that it works on the same standard, or even a lesser one, but for cheaper, with no complaints, and 24/7.

▲

nibbleyou 35 minutes ago | parent | prev [-]

Because a human fails in a known way. If a human does not have expertise in domain X or tech Y, they will fail there and the expectation is that they will fail.

With an LLM you never know where it can fail. There is no domain expertise for an LLM. It can fail in a miserable way in the same domain it worked spectacularly for.

▲

21asdffdsa12 3 hours ago | parent | prev | next [-]

It really depends on the field you are in and the tasks you set and how much of it was in the training set? A webdeveloper will find it succeeding in all taks - while some c++ exotic physics simulation developer will find it lacking.

The "works for me" is telling more about the field of the LLM reviewer, then the LLM.

▲

wolvesechoes 2 hours ago | parent [-]

> while some c++ exotic physics simulation developer will find it lacking

Can confirm, but I always read I am holding it wrong.

▲

20k an hour ago | parent | next [-]

I've consistently tried to apply LLMs to physics problems and they're utterly useless. They'll just confidently lie, or blatantly plagiarise source materials

The issue is once you hit niche physics simulations there simply isn't any training data available, so the limitations of them become incredibly apparent. Its also problematic because a field itself will contain lots of wrong information (its research!), and AI picks all this up uncritically

I thought I'd give chatgpt a quick spin on my favourite question, which is "is the adm formalism strictly equivalent to general relativity", to which it consistently gives the wrong answer

>Ah, now you’re hitting the subtlety head-on—that’s exactly where the “strict equivalence” claim needs nuance. Let’s unpack this carefully.

I don't know how anyone can stand these tools. Its just an obnoxious glazing machine that tells me I'm a genius consistently

Gemini gives a little more of a robust answer, but fails catastrophically for the question "is the bssn formalism numerically stable", where just about the entire answer is completely wrong from top to bottom. It certainly looks convincing. Its got all the right terminology. It manages to piece together the right set of words, but all the informational content is wrong, which isn't exactly a small problem

I struggle to see how these tools are of any use

	▲	sofixa an hour ago \| parent \| next [-]
		That's why there are companies specialising in AI for physics, like Emmi AI (now part of Mistral). If BMW and Airbus go on stage to talk about how they're using it for their physics simulations, it's probably at least decent.
	▲	otabdeveloper4 25 minutes ago \| parent \| prev [-]
		> confidently lie, or blatantly plagiarise Good enough for enterprise work tho. (Also the secret sauce to "holding LLMs right".)

▲

OtomotO an hour ago | parent | prev [-]

You're not. People are just using a hammer to build a shed and telling you it's surely good to dig a hole too.

▲

hodgehog11 an hour ago | parent | prev | next [-]

I get about the same success rate with my problems (scientific computing usually), but they're often _much_ easier to check than to write, so an 80% success rate becomes game-changing.

▲

an hour ago | parent | prev | next [-]

[deleted]

▲

an hour ago | parent | prev [-]

[deleted]

▲

digitaltrees 3 hours ago | parent | prev [-]

I agree. I feel like sonnet 4.6 is sufficient for almost everything. Beyond that level it feels like the orchestration is more important.

That being said the models still surprise me with a broad range of hallucinations, lack of epistemology or common sense or inability to follow instructions on a daily basis.

Today it was trying to get opus 4.8 to just follow a simple architectural pattern for controllers in a rails app. It was pulling teeth out of a shark.

	▲	2 hours ago \| parent [-]
		[deleted]