Remix.run Logo
JKCalhoun 2 days ago

He recently posted a question he put to grok3 — a variation on the trick LLM question (my characterization) of "count the number of this letter in this word." Apparently this Achilles heel is a well-known LLM shortcoming.

Weirdly though, I tried the same example he gave on lmarena and actually got the correct result from grok3, not what Gary got. So I am a little suspicious of his ... methodology?

Since LLMs are not deterministic it's possible we are both right (or were testing different variations on the model?). But there's a righteousness about his glee in finding these faults in LLMs. Never hedging with, "but your results may vary" or "but perhaps they will soon be able to accomplish this."

EDIT: the exact prompt (his typo 'world'): "Can you circle all the consonants in the world Chattanooga"

jonny_eh 2 days ago | parent | next [-]

I think it's fair to say though that if your results may vary, and be wrong, then they're not reliable enough for many use-cases. I'd have to see his full argument though to see if that's what he was claiming. I'm just trying to be charitable here.

JKCalhoun 2 days ago | parent [-]

I'm trying to be charitable as well — I suppose to both sides of the debate. Myself, I see pros and cons. The hype absolutely needs to be shut down, but a spokesperson that is more even-handed would be more convincing (in my opinion).

Here is his post, FWIW: https://garymarcus.substack.com/p/grok-3-beta-in-shambles

giardini a day ago | parent [-]

JKCalhoun says "...a spokesperson that is more even-handed would be more convincing (in my opinion)."

Why? The stance of science toward new "discoveries" should always be skepticism.

JKCalhoun 12 hours ago | parent [-]

I agree. I also think you can find the line between skepticism and partisanship.

th0ma5 2 days ago | parent | prev [-]

I don't see it as righteous glee but just hoping that people will see the problem with how you could even begin to be suspicious of him. If it is so easy to get something wrong when you're trying to be correct, or get something accidentally correct as you're trying to expose things that are wrong ... Then what are we really doing here with these things.

JKCalhoun 2 days ago | parent [-]

Well, like any tool, hopefully using it where it makes sense. We already know that asking it to count vowels, etc. is not what we should be doing with these things. Writing code in Python however is a very different story.

th0ma5 a day ago | parent [-]

Right it is even more problematic with code making hidden mistakes no person would ever make.