RL is barely even a training method, its more of a dataset generation method.

I feel like both this comment and the parent comment highlight how RL has been going through a cycle of misunderstanding recently from another one of its popularity booms due to being used to train LLMs

	▲	mistercheph 3 hours ago \| parent [-]
		care to correct the misunderstanding?