This study seemed to be before the reasoning models came out. With them I have the opposite problem. I ask something simple and it responds with what reads like a scientific paper.

▲

ijk 2 days ago | parent [-]

Of course "reads like" is part of the problem. The models are very good at producing something that reads like the kind of document I asked for and not as good at guaranteeing that the document has the meaning I intended.

▲

andai a day ago | parent [-]

That is true. What I meant was, I'll ask it for some practical problem I'm dealing with in my life, and it will start talking about how to model it in terms of a cybernetic system with inertia, springs and feedback loops.

Not a bad line of thinking, especially if you're microdosing, but I find myself turning off reasoning more frequently that I'd expected, considering it's supposed to be objectively better.

	▲	ijk a day ago \| parent [-]
		I find that for more "intuitive" evaluations, reasoning tends to hurt more than it helps. In other words, if it can do a one-shot classification correctly, adding a bunch of second guessing just degrades the performance. This may change as our RL methods get better at properly rewarding correct partial traces and penalizing overthinking, but for the moment there's often a stark difference when a multi-step process improves the model's ability to reason through the context and when it doesn't. This is made more complicated (for human prompters and evaluators) by the fact that (as Anthropic has demonstrated) the text of the reasoning trace means something very different for the model versus how a human is interpreting it. The reasoning the model claims it is doing can sometimes be worlds away from the actual calculations (e.g., how it uses helixal structures to do addition [1]). [1] https://openreview.net/pdf?id=CqViN4dQJk