Remix.run Logo
jamincan 2 days ago

For what it's worth, when I ask ChatGPT 5, it gets the correct answer every time. The response varies, but the answer is always three.

ViewTrick1002 2 days ago | parent [-]

Now try a different language. My take is hard RL tuning to fix these "gotcha:s" since the underlying model can't do it on its own.

OpenAI is working on ChatGPT the application and ecosystem. They have transitioned from model building to software engineering with RL tuning and integration of various services to solve the problems the model can't do on its own. Make it feel smart rather than be smart.

This means that as soon as you find a problem where you step out of the guided experience you get the raw model again which fails when encountering these "gotchas".

Edit - Here's an example where we see a very tuned RL experience in English where a whole load of context is added on how to solve the problem while the Swedish prompt for the same word fails.

https://imgur.com/a/SlD84Ih

ACCount37 2 days ago | parent [-]

You can tell it "be careful about the tokenizer issues" in Swedish and see how that changes the behavior.

The only thing that this stupid test demonstrates is that LLM metacognitive skills are still lacking. Which shouldn't be a surprise to anyone. The only surprising thing is that they have metacognitive skills, despite the base model training doing very little to encourage their development.

lossyalgo a day ago | parent [-]

LLMs were not designed to count letters[0] since they work with tokens, so whatever trick they are now doing behind the scenes to handle this case, can probably only handle this particular case. I wonder if it's now included in the system prompt. I asked ChatGPT and it said it's now using len(str) and some other python scripts to do the counting, but who knows what's actually happening behind the scenes.

[0] https://arxiv.org/pdf/2502.16705

ACCount37 a day ago | parent [-]

There's no "trick behind the scenes" there. You can actually see the entire trick being performed right in front of you. You're just not paying attention.

That trick? The LLM has succeeded by spelling the entire word out letter by letter first.

It's much easier for an LLM to perform "tokenized word -> letters -> letter counts" than it is to perform "tokenized word -> letter counts" in one pass. But it doesn't know that! It copies human behavior from human text, and humans never had to deal with tokenizer issues in text!

You can either teach the LLM that explicitly, or just do RLVR on diverse tasks and hope it learns the tricks like this by itself.