The thinking models are additionally trained with reinforcement learning to produce chain of thought reasoning