Aren't the actual reasoning tokens already surprisingly divergent from the models' actual thought process? I've seen at least three separate studies on that subject.