Remix clone Hacker News

new | show | ask | jobs Github

	▲	Ifkaluva an hour ago
		It’s quite common these days to treat an LLM as a policy in the sense that it takes as a “state” the previous context, and its task is to choose a continuation, as an “action”. It gets a “reward” from a reward model that was trained on human preferences, or from a verifiable source, such as passing test cases. This framing has been active for several years, as it’s the framing that enables RLHF and RLVR. RLHF itself is quite old, I think since the original chatGPT.