I'm sure this comment was made in good faith, but most researchers would rightfully understand these intricacies, and this is likely intentional(as noted in the paper). At a quick glance, I cannot say whether or not the paper has been peer reviewed(though unlikely/in process given how recent it was published). In general, you'd find published papers also listed in a specific journal/conference(i.e. not just the archives which anyone can submit to).

Additionally, many of us in the field of researching LLM's are curious to understanding the boundaries and limitations of what is capable. This paper isn't really meant as any sort of "gotcha", rather serve as a possible basis point for future work. Though with a caveat I'm still digesting the paper myself.

▲

simianwords 3 days ago | parent [-]

I'm asking, why use a thinking model without allowing it to reason? No one uses it in that way..

>While LLMs appear extremely intelligent and capable of reasoning, they sometimes make mistakes that seem inconceivably foolish from a human perspective. For example, GPT-5.2 can implement complex fluid dynamics simulation code, yet it cannot even compute the parity of the short string 11000, cannot determine whether the parentheses in ((((()))))) are balanced, and makes calculation errors on 127 × 82 (Figure 1).

Why would they say it is capable of reasoning and then not allow it to reason in the experiment?

	▲	Chobilet 2 days ago \| parent [-]
		"We propose Zero-Error Horizon (ZEH) for trustworthy LLMs, which represents the maximum range that a model can solve without any errors." I'm again taking your responses in good faith, but the abstract answers your question about what they are trying to achieve. For any statistical significance, you'd want to point to a baseline comparison(e.g. what I'm guessing is what you mean by "no reasoning" here). You'll also note within the paper, the author argues and cites that failing at the baseline step(e.g. multiplication) has shown "that error often adversely affects subsequent reasoning [38, 44]". Which indicates to me, we don't need to use further "reasoning" given previous results/studies show a decline once our base has an error. To me, this seems like a fair assumption. Given though this is an active field of research, and we are largely testing a black box application, we can't say for certain. Further studies(like this one) will give researchers a better understand at what is and isn't possible.