Remix.run Logo
impossiblefork 5 hours ago

But consider it like this: the model lives in a reward environment where it's tasked with outputting prescribed text or outputting the answer to certain questions.

Instead of just outputting the answer it generates non-output tokens based on which the probability of the answer that got it rewards before are increased.

Is this not a sort of reasoning? It looks ahead at imagined things and tries to gauge what will get it the reward?