▲ | toxik 3 days ago | |
Why do you think that? InstructGPT was predominantly trained as a next-token predictor on whatever soup of data OpenAI curated at the time. The alignment signal (both RL part and the supervised prompt/answer pairs) are a tiny bit of the gradient. |