▲ | porridgeraisin 2 days ago | |||||||||||||||||||||||||||||||
Nevertheless it is next token prediction. Each token it predicts is an action for the RL setup, and context += that_token is the new state. Solutions are either human labelled (RLHF) or to math/code problems (deterministic answer) and prediction == solution is used as the reward signal. Policy gradient approaches in RL are just supervised learning, when viewed through some lenses. You can search for karpathys more fleshed out argument for the same, I'm on mobile jow. | ||||||||||||||||||||||||||||||||
▲ | SEGyges 2 days ago | parent | next [-] | |||||||||||||||||||||||||||||||
My short explanation would be that even for RL, you are training on a next token objective; but the next token is something that has been selected very very carefully for solving the problem, and was generated by the model itself. So you're amplifying existing trajectories in the model by feeding the model's outputs back to itself, but only when those outputs solve a problem. This elides the kl penalty and the odd group scoring, which are the same in the limit but vastly more efficient in practice. | ||||||||||||||||||||||||||||||||
▲ | astrange 2 days ago | parent | prev [-] | |||||||||||||||||||||||||||||||
Inference often isn't next token prediction though, either weakly (because of speculative decoding/multiple token outputs) or strongly (because of tool usage like web search). | ||||||||||||||||||||||||||||||||
|