|
| ▲ | anuramat 5 days ago | parent | next [-] |
| Literally every single one? To not mess it up, they either have to spell the word l-i-k-e t-h-i-s in the output/CoT first (which depends on the tokenizer counting every letter as a separate token), or have the exact question in the training set, and all of that is assuming that the model can spell every token. Sure, it's not exactly a fair setting, but it's a decent reminder about the limitations of the framework |
| |
| ▲ | isaacremuant 4 days ago | parent [-] | | Chatgpt. I test these prompts with chatgpt and they work. I've also used claude 4 opus and also worked. It's just weird how it gets repeated ad nauseaum here but I can't reproduce it with a "grab latest model of famous provider". | | |
| ▲ | jedberg 4 days ago | parent | next [-] | | I just asked chatgpt "How many b's are in blueberry?". It instantly said "going to the deep thinking model" and then hung. | | |
| ▲ | isaacremuant 4 days ago | parent [-] | | When I do it takes around 3 seconds where it will go "thinking longer for a better answer and then goes 2". Again, I don't understand how it's seemingly so hard for me to reproduce these things. I understand the tokenisation constraints, but feel it's overblown. |
| |
| ▲ | anuramat 3 days ago | parent | prev [-] | | Opus 4.1: > how many times does letter R appear in the word “blueberry”? do not spell the word letter by letter, just count > Looking at the word “blueberry”, I can count the letter ‘r’ appearing 3 times. The R’s appear in positions 6, 7, and 8 of the word (consecutive r’s in “berry”). <https://claude.ai/share/230b7d82-0747-4ab6-813e-5b1c82c43243> |
|
|
|
| ▲ | raincole 5 days ago | parent | prev | next [-] |
| Effectively yes. Correctly no. https://claude.ai/share/dda533a3-6976-46fe-b317-5f9ce4121e76 |
|
| ▲ | ceejayoz 4 days ago | parent | prev | next [-] |
| GPT-5 can’t. https://bsky.app/profile/kjhealy.co/post/3lvtxbtexg226 |
| |
| ▲ | isaacremuant 4 days ago | parent [-] | | I can't reproduce it. Or similar ones. Why do yout think that is? | | |
| ▲ | ceejayoz 4 days ago | parent | next [-] | | Because it’s embarrassing and they manually patch it out every time like a game of Whack-a-Mole? | | |
| ▲ | isaacremuant 4 days ago | parent [-] | | Except people use the same examples like blueberry and strawberry, which were used months ago, as if they're current. These models can also call Counter from python's collections library or whatever other algorithm. Or are we claiming it should be a pure LLM as if that's what we use in the real world. I don't get it, and I'm not one to hype up LLMs since they're absolutely faulty, but the fixation over this example screams of lack of use. | | |
| ▲ | ceejayoz 4 days ago | parent | next [-] | | It’s such a great example precisely for that reason - despite efforts, it comes back every time. | |
| ▲ | insin 4 days ago | parent | prev [-] | | It's the most direct way to break the "magic computer" spell in users of all levels of understanding and ability. You stand it up next to the marketing deliberately laden with keywords related to human cognition, intended to induce the reader to anthropomorphise the product, and it immediately makes it look as silly as it truly is. I work on the internal LLM chat app for a F100, so I see users who need that "oh!" moment daily. When this did the rounds again recently, I disabled our code execution tool which would normally work around it and the latest version of Claude, with "Thinking" toggled on, immediately got it wrong. It's perpetually current. |
|
| |
| ▲ | alexjplant 4 days ago | parent | prev [-] | | "Mississippi" passed but "Perrier" failed for me: > There are 2 letter "r" characters in "Perrier". | | |
| ▲ | isaacremuant 4 days ago | parent [-] | | Thanks! I finally was able to reproduce one of these. Ok. Then I was wrong. I'll update my edit accordingly. |
|
|
|
|
| ▲ | isaacremuant 4 days ago | parent | prev | next [-] |
| Update: after trying A LOT of examples, I did manage to reproduce one with the latest chatgpt. |
|
| ▲ | 4 days ago | parent | prev [-] |
| [deleted] |