▲ | markisus 2 days ago | ||||||||||||||||||||||||||||||||||||||||||||||
I’m not sure that LLMs are solely autocomplete. The next token prediction task is only for pretraining. After that I thought you apply reinforcement learning. | |||||||||||||||||||||||||||||||||||||||||||||||
▲ | porridgeraisin 2 days ago | parent [-] | ||||||||||||||||||||||||||||||||||||||||||||||
Nevertheless it is next token prediction. Each token it predicts is an action for the RL setup, and context += that_token is the new state. Solutions are either human labelled (RLHF) or to math/code problems (deterministic answer) and prediction == solution is used as the reward signal. Policy gradient approaches in RL are just supervised learning, when viewed through some lenses. You can search for karpathys more fleshed out argument for the same, I'm on mobile jow. | |||||||||||||||||||||||||||||||||||||||||||||||
|