Remix clone Hacker News

new | show | ask | jobs Github

	▲	chongliqin 3 days ago
		TD-based approaches can have an advantage in sparse reward settings, but they come with a heap of other problems especially in the off-policy setting (see the deadly triad) and are typically not used for LLM training. We here make a connection to REINFORCE style policy gradients which would not show any of the behavior you mentioned above.