▲ | withinboredom 5 days ago | |||||||
Oh man, if you want to see a thinking model lose its mind... write a list of ten items and ask "what is the best of these nine items?"[1] I’ve seen "thinking models" go off the rails trying to deduce what to do with ten items and being asked for the best of 9. [1]: the reality of the situation is subtle internal inconsistencies in the prompt can really confuse it. It is an entertaining bug in AI pipelines, but it can end up costing you a ton of money. | ||||||||
▲ | irthomasthomas 5 days ago | parent | next [-] | |||||||
Thank you. This is an excellent argument against using models with hidden COT tokens (claude, gemini, GPT-5). You could end up paying for a huge number of hidden reasoning tokens that aren't useful. And the issue masked by the hidden COT summaries. | ||||||||
▲ | cout 5 days ago | parent | prev | next [-] | |||||||
Can you elaborate on what it means for a model to "lose its mind"? I tried what you suggested and the response seemed reasonable-ish, for an unreasonable question. | ||||||||
| ||||||||
▲ | Ghoelian 5 days ago | parent | prev | next [-] | |||||||
Unfortunately Claude Code seems a little too "smart" for that one. Its response started with "I notice you listed 10 frameworks, not 9." | ||||||||
| ||||||||
▲ | commakozzi 4 days ago | parent | prev [-] | |||||||
I've been following the progress of LLMs since the first public release of GPT-3.5, and every single time someone posts one of these tests i check the AIs i'm using to see if it's repeatable. It NEVER is. Granted, i'm not using the API, i'm using the chat interface with potentially different system prompting? Here's GPT-5's response: me: which is the best of the following 9 items: 1. bear, 2. car. 3. plane, 4. house, 5. high-rise, 6. church, 7. boat, 8. tree, 9. truck, 10. duck. GPT-5: Thought for 11s. House. It provides essential, lasting shelter and broad utility. Note: you listed 10 items, not 9. edited: i saw someone mention that the chat interface doesn't repeat the results you get via API. | ||||||||
|