Remix.run Logo
simianwords 3 days ago

This paper is complete nonsense. The specific prompt they used doesn’t specify reasoning effort. Which defaults to none.

   {
  "model": "gpt-5.2-2025-12-11",
  "instructions": "Is the parentheses string balanced? Answer with only Yes or No.",
  "input": "((((())))))",
  "temperature": 0
   }
> Lower reasoning effort

The reasoning.effort parameter controls how many reasoning tokens the model generates before producing a response. Earlier reasoning models like o3 supported only low, medium, and high: low favored speed and fewer tokens, while high favored more thorough reasoning.

Starting with GPT-5.2, the lowest setting is none to provide lower-latency interactions. This is the default setting in GPT-5.2 and newer models. If you need more thinking, slowly increase to medium and experiment with results.

With reasoning effort set to none, prompting is important. To improve the model’s reasoning quality, even with the default settings, encourage it to “think” or outline its steps before answering.

———————-

So in the paper, the model very likely used no reasoning tokens. (Only uses it if you ask for it specifically in prompt). What is the point of such a paper? We already know that reasoning tokens are necessary.

Edit: I actually ran the prompt and this was the response

   {
  "model": "gpt-5.2-2025-12-11",
  "output_text": "Yes",
  "reasoning": {
    "effort": "none",
    "summary": null
  },
  "usage": {
    "input_tokens": 26,
    "output_tokens": 5,
    "total_tokens": 31,
    "output_tokens_details": {
      "reasoning_tokens": 0
    }
  }
}

So reasoning_tokens used were zero. So this whole paper is kinda useless and misleading. Did this get peer reviewed or something?

Chobilet 3 days ago | parent [-]

I'm sure this comment was made in good faith, but most researchers would rightfully understand these intricacies, and this is likely intentional(as noted in the paper). At a quick glance, I cannot say whether or not the paper has been peer reviewed(though unlikely/in process given how recent it was published). In general, you'd find published papers also listed in a specific journal/conference(i.e. not just the archives which anyone can submit to).

Additionally, many of us in the field of researching LLM's are curious to understanding the boundaries and limitations of what is capable. This paper isn't really meant as any sort of "gotcha", rather serve as a possible basis point for future work. Though with a caveat I'm still digesting the paper myself.

simianwords 3 days ago | parent [-]

I'm asking, why use a thinking model without allowing it to reason? No one uses it in that way..

>While LLMs appear extremely intelligent and capable of reasoning, they sometimes make mistakes that seem inconceivably foolish from a human perspective. For example, GPT-5.2 can implement complex fluid dynamics simulation code, yet it cannot even compute the parity of the short string 11000, cannot determine whether the parentheses in ((((()))))) are balanced, and makes calculation errors on 127 × 82 (Figure 1).

Why would they say it is capable of reasoning and then not allow it to reason in the experiment?

Chobilet 2 days ago | parent [-]

"We propose Zero-Error Horizon (ZEH) for trustworthy LLMs, which represents the maximum range that a model can solve without any errors."

I'm again taking your responses in good faith, but the abstract answers your question about what they are trying to achieve. For any statistical significance, you'd want to point to a baseline comparison(e.g. what I'm guessing is what you mean by "no reasoning" here). You'll also note within the paper, the author argues and cites that failing at the baseline step(e.g. multiplication) has shown "that error often adversely affects subsequent reasoning [38, 44]".

Which indicates to me, we don't need to use further "reasoning" given previous results/studies show a decline once our base has an error. To me, this seems like a fair assumption. Given though this is an active field of research, and we are largely testing a black box application, we can't say for certain. Further studies(like this one) will give researchers a better understand at what is and isn't possible.