▲ | chaeronanaut a day ago | |||||||
> The words that are coming out of the model are generated to optimize for RLHF and closeness to the training data, that's it! This is false, reasoning models are rewarded/punished based on performance at verifiable tasks, not human feedback or next-token prediction. | ||||||||
▲ | Xelynega a day ago | parent [-] | |||||||
How does that differ from a non-reasoning model rewarded/punished based on performance at verifiable tasks? What does CoT add that enables the reward/punishment? | ||||||||
|