Remix.run Logo
ripped_britches 5 days ago

Everyone on HN is like “yes I knew it! I was so right in 2021 that LLMs were just stochastic parrots!”

Strangely one of the most predictable groups of people

pessimizer 5 days ago | parent | next [-]

Because they are. But stochastic parrots are awesome.

ripped_britches 5 days ago | parent [-]

I challenge you! Try giving this exact prompt to GPT-5-Thinking (medium or high reasoning if API). It is able to (without external code tools) solve a never before seen cypher that is not present in its training data. I think this pretty clearly demonstrates that the “stochastic parrot” is no longer an apt description of its capabilities in generalization:

————

You are given a character-by-character decode table `mapping` and a `ciphertext`. Decode by replacing each ciphertext character `c` with `mapping[c]` (i.e., mapping maps ciphertext → plaintext). Do not guess; just apply the mapping.

Return *ONLY* this JSON (no prose, no extra keys, no code fences):

{ "decoded_prefix": "<first 40 characters of the decoded plaintext>", "last_10": "<last 10 characters of the decoded plaintext>", "vowel_counts": {"a": <int>, "e": <int>, "i": <int>, "o": <int>, "u": <int>} }

Inputs use only lowercase a–z.

mapping = { "a":"c","b":"j","c":"b","d":"y","e":"w","f":"f","g":"l","h":"u","i":"m","j":"g", "k":"x","l":"i","m":"o","n":"n","o":"h","p":"a","q":"d","r":"t","s":"r","t":"v", "u":"p","v":"s","w":"z","x":"k","y":"q","z":"e" }

ciphertext = "nykwnowotyttbqqylrzssyqcmarwwimkiodwgafzbfippmndzteqxkrqzzophqmqzlvgywgqyazoonieqonoqdnewwctbsbighrbmzltvlaudfolmznbzcmoafzbeopbzxbygxrjhmzcofdissvrlyeypibzzixsjwebhwdjatcjrzutcmyqstbutcxhtpjqskpojhdyvgofqzmlwyxfmojxsxmb"

DO NOT USE ANY CODE EXECUTION TOOLS AT ALL. THAT IS CHEATING.

vbarrielle 5 days ago | parent | next [-]

It's cute that you think your high-school level cypher is probably not seen in the training set of one of the biggest LLMs in the world. Surely no one could have thought of such a cypher, let alone create exercises around it!

No one should ever make claims such as "X is not in <LLM>'s training set". You don't know. Even if your idea is indeed original, nothing prevents someone from having though of it before, and published it. The history of science is full of simultaneous discoveries, and we're talking cutting-edge research.

ripped_britches 3 days ago | parent [-]

The point is not that the cypher is hard, the point is that the randomish string it needs to answer the question can’t possibly be computed just from correlations from the training data. Rather, it learned an emergent, generalizable skill that it used to solve it.

skate 4 days ago | parent | prev | next [-]

As others pointed out this problem isn't special.

Grok 4 heavy Thought for 4m 17s

{"decoded_prefix": "nqxznhzhvqvvjddqiterrqdboctzzmoxmhyzlcfe", "last_10": "kfohgkrkoj", "vowel_counts": {"a": 7, "e": 18, "i": 7, "o": 12, "u": 6}}

it did count another e, but that's a known point of failure for LLMs which i assume you put in intentionally.

>Counting e's shows at least 10 more, so total e's are <at least> 17.

ripped_britches 3 days ago | parent [-]

I guess GPT-5 with thinking is still a bit ahead of grok. I wonder what the secret sauce is.

3 days ago | parent [-]
[deleted]
philipwhiuk 4 days ago | parent | prev | next [-]

This is just Caesar cipher with extra steps.

ripped_britches 3 days ago | parent [-]

The point is not that the cypher is unique, it’s that the string is unique

incr_me 5 days ago | parent | prev | next [-]

That's exactly the sort of thing a "stochastic parrot" would excel at. This could easily serve as a textbook example of the attention mechanism.

ripped_britches 3 days ago | parent [-]

How about this alternative challenge: ask it to write a poem in IPA (pronunciation language). I’d be surprised if this has ever been done pre-LLM, yet it excels at weird tasks like this.

You could probably just ask it to come up with 100 tasks to prove it’s not a stochastic parrot.

incr_me 3 days ago | parent [-]

Yeah, I think "stochastic parrot" is a crappy phrase that obscures the mechanics of the LLM. Of course the LLM is capable of producing novel outputs, for some definition of novel. My only position here is that we can take any apparently magical outputs of the thing and, based on an understanding of how LLMs work, understand how they were likely produced. I think that sort of literacy will take us a long way.

skeezyboy 4 days ago | parent | prev [-]

{ "decoded_prefix": "nxcznchvhvvrddqinqtrrqdboctzzimxmhlyflcjfjapponydzwkxdtdehldmodizslzl", "last_10": "sxmb", "vowel_counts": { "a": 10, "e": 6, "i": 13, "o": 13, "u": 6 } }

took about 2 seconds, must have had it cached

ripped_britches 3 days ago | parent [-]

I’m pretty sure caching is only controlled within each customer org but I could be wrong. Either way it seems to be a good result.

Tanjreeve 4 days ago | parent | prev [-]

This reads like you're ridiculing people for being proved right?

ripped_britches 3 days ago | parent [-]

No the point of the comment is that there is no meaningful difference between model performance improvements from before and after this news of a benchmark weakness (spoiler alert, almost all of the benchmarks contain serious problems). The models are improving every quarter whether HN likes it or not.