Remix.run Logo
chongliqin 3 days ago

TD-based approaches can have an advantage in sparse reward settings, but they come with a heap of other problems especially in the off-policy setting (see the deadly triad) and are typically not used for LLM training.

We here make a connection to REINFORCE style policy gradients which would not show any of the behavior you mentioned above.