If it's only pretending to reason, then how is it that the CoT output improves performance on every single benchmark/test?