Remix clone Hacker News

new | show | ask | jobs Github

	▲	impossiblefork 5 hours ago
		But consider it like this: the model lives in a reward environment where it's tasked with outputting prescribed text or outputting the answer to certain questions. Instead of just outputting the answer it generates non-output tokens based on which the probability of the answer that got it rewards before are increased. Is this not a sort of reasoning? It looks ahead at imagined things and tries to gauge what will get it the reward?