Remix.run Logo
withinboredom 5 days ago

Oh man, if you want to see a thinking model lose its mind... write a list of ten items and ask "what is the best of these nine items?"[1]

I’ve seen "thinking models" go off the rails trying to deduce what to do with ten items and being asked for the best of 9.

[1]: the reality of the situation is subtle internal inconsistencies in the prompt can really confuse it. It is an entertaining bug in AI pipelines, but it can end up costing you a ton of money.

irthomasthomas 5 days ago | parent | next [-]

Thank you. This is an excellent argument against using models with hidden COT tokens (claude, gemini, GPT-5). You could end up paying for a huge number of hidden reasoning tokens that aren't useful. And the issue masked by the hidden COT summaries.

cout 5 days ago | parent | prev | next [-]

Can you elaborate on what it means for a model to "lose its mind"? I tried what you suggested and the response seemed reasonable-ish, for an unreasonable question.

withinboredom 5 days ago | parent [-]

COT looks something like: “user has provided a lbreakdown with each category having ten items, but then says the breakdown contains 5 items each. I see some have 5 and some have 10.” And then continues trying to work out which one is the right one, whether it is a mistake, how it should handle it, etc. It can literally spend thousands of tokens on this.

Ghoelian 5 days ago | parent | prev | next [-]

Unfortunately Claude Code seems a little too "smart" for that one. Its response started with "I notice you listed 10 frameworks, not 9."

withinboredom 5 days ago | parent [-]

You usually hit the pathological case when you have your own system prompt (i.e. over an API) forcing it to one-shot an action. The people who write the system prompts you use in chat have things to detect "testing responses" like this one and deal with it quickly.

commakozzi 4 days ago | parent | prev [-]

I've been following the progress of LLMs since the first public release of GPT-3.5, and every single time someone posts one of these tests i check the AIs i'm using to see if it's repeatable. It NEVER is. Granted, i'm not using the API, i'm using the chat interface with potentially different system prompting?

Here's GPT-5's response:

me: which is the best of the following 9 items: 1. bear, 2. car. 3. plane, 4. house, 5. high-rise, 6. church, 7. boat, 8. tree, 9. truck, 10. duck.

GPT-5: Thought for 11s. House. It provides essential, lasting shelter and broad utility. Note: you listed 10 items, not 9.

edited: i saw someone mention that the chat interface doesn't repeat the results you get via API.

withinboredom 4 days ago | parent [-]

I've only seen this happen on API calls where you need to

1) one-shot the result, chatting isn't an option; so it is trying to figure out what to do to accomplish its goal.

2) with subtle inconsistencies. My example was mostly an illustration, I don't remember the exact details. Unfortunately, it has been too long and my logs are gone, so I can't give real examples.