▲ | chongliqin 3 days ago | |
TD-based approaches can have an advantage in sparse reward settings, but they come with a heap of other problems especially in the off-policy setting (see the deadly triad) and are typically not used for LLM training. We here make a connection to REINFORCE style policy gradients which would not show any of the behavior you mentioned above. |