I find that for more "intuitive" evaluations, reasoning tends to hurt more than it helps. In other words, if it can do a one-shot classification correctly, adding a bunch of second guessing just degrades the performance.
This may change as our RL methods get better at properly rewarding correct partial traces and penalizing overthinking, but for the moment there's often a stark difference when a multi-step process improves the model's ability to reason through the context and when it doesn't.
This is made more complicated (for human prompters and evaluators) by the fact that (as Anthropic has demonstrated) the text of the reasoning trace means something very different for the model versus how a human is interpreting it. The reasoning the model claims it is doing can sometimes be worlds away from the actual calculations (e.g., how it uses helixal structures to do addition [1]).
[1] https://openreview.net/pdf?id=CqViN4dQJk