Nevertheless it is next token prediction. Each token it predicts is an action for the RL setup, and context += that_token is the new state. Solutions are either human labelled (RLHF) or to math/code problems (deterministic answer) and prediction == solution is used as the reward signal.

Policy gradient approaches in RL are just supervised learning, when viewed through some lenses. You can search for karpathys more fleshed out argument for the same, I'm on mobile jow.

▲

SEGyges 2 days ago | parent | next [-]

My short explanation would be that even for RL, you are training on a next token objective; but the next token is something that has been selected very very carefully for solving the problem, and was generated by the model itself.

So you're amplifying existing trajectories in the model by feeding the model's outputs back to itself, but only when those outputs solve a problem.

This elides the kl penalty and the odd group scoring, which are the same in the limit but vastly more efficient in practice.

▲

astrange 2 days ago | parent | prev [-]

Inference often isn't next token prediction though, either weakly (because of speculative decoding/multiple token outputs) or strongly (because of tool usage like web search).

▲

porridgeraisin 2 days ago | parent [-]

Well I may be misunderstanfing you, but speculative decoding is just using next token predictions from a few models(or many samples from one model) instead of just one sample. It is still next token prediction.

Tool usage is also just next token prediction. You have it predict the next token of the syntax needed for tool use, and then it is fed the result of that in context which it then predicts the next token of.

▲

astrange 2 days ago | parent [-]

The text returned by the tool itself makes it not "next token prediction". Aside from having side effects, the reason it's helpful is that it's out of distribution for the model. So it changes the properties of the system.

	▲	SEGyges 12 hours ago \| parent \| next [-]
		This is true of the system as a whole, but the core neural network is still a next-token predictor.
	▲	porridgeraisin 2 days ago \| parent \| prev [-]
		Ah ok, understood what you meant.