How is this kind of analogy helpful? You can frame any optimization problem as RL if you try hard enough. RL is a method of optimization which calls the optimum "reward maximization". You can craft the reward function any which way you want.

The key point about RL is that it is a sequential decision making process. If you don't have something (an agent) making multiple decisions over time while interacting with an environment, then why bother calling it RL?

▲

imtringued 4 days ago | parent [-]

I personally am quite disappointed by the abstract:

"Building on existing literature, we clarify that SFT can be understood as maximizing a lower bound on the RL objective in a sparse reward setting."

uh no? SFT is maximizing the RL objective in a dense reward setting. The entire point of RL, specifically actor-critic and Q-Learning, is that the RL method turns the sparse reward into a continuous dense reward against which a model can be trained on with classic gradient descent.

I mean look at the definition of Q-Learning and the bellman equation it uses. It maximizes the current reward by choosing the current action based on whether it maximizes the predicted reward, not the actual reward, which doesn't have to be continuous or produce a gradient. You can build an RL based maze solver where only the goal gives a reward to the model and it would work, albeit it would train extremely slowly.

Meanwhile supervised fine tuning always produces a continuous gradient on every single token.

	▲	chongliqin 3 days ago \| parent [-]
		TD-based approaches can have an advantage in sparse reward settings, but they come with a heap of other problems especially in the off-policy setting (see the deadly triad) and are typically not used for LLM training. We here make a connection to REINFORCE style policy gradients which would not show any of the behavior you mentioned above.