Remix clone Hacker News

new | show | ask | jobs Github

	▲	alcinos 6 days ago
		> We've just only started RL training LLMs That's just factually wrong. Even the original chatGPT model (based on gpt3.5, released in 2022) was trained with RL (specifically RLHF).
	▲	prasoon2211 6 days ago \| parent \| next [-]
		RLHF is not the "RL" the parent is posting about. RLHF is specifically human driven reward (subjective, doesn't scale, doesn't improve the model "intelligence", just tweaks behavior) - which is why the labs have started calling it post-training, not RLHF, anymore. True RL is where you set up an environment where an agent can "discover" solutions to problems by iterating against some kind of verifiable reward AND the entire space of outcomes is theoretically largely explorable by the agent. Maths and Coding are have proven amenable to this type of RL so far.
	▲	manscrober 6 days ago \| parent \| prev \| next [-]
		a) 2022 is not too long ago b) this was a first important step to usable ai but not scalable. I'd say "RL training" is not the same as RLHF.
	▲	bigyabai 6 days ago \| parent \| prev [-]
		The original ChatGPT was like 3 years after the first usable transformer models.