▲ | impossiblefork 5 hours ago | |
But consider it like this: the model lives in a reward environment where it's tasked with outputting prescribed text or outputting the answer to certain questions. Instead of just outputting the answer it generates non-output tokens based on which the probability of the answer that got it rewards before are increased. Is this not a sort of reasoning? It looks ahead at imagined things and tries to gauge what will get it the reward? |