Remix.run Logo
didroe 5 days ago

>With RL, models no longer just learn what sounds correct based on patterns they've seen. They learn what words to output to be correct. RL is the process of forcing the pre-trained weights to be logically consistent.

How does Reinforcement Learning force the weights to be logically consistent? Isn't it just about training using a coarser/more-fuzzy granularity of fitness?

More generally, is it really solving the task if it's given a large number of attempts and an oracle to say whether it's correct? Humans can answer the questions in one shot and self-check the answer, whereas this is like trial and error with an external expert who tells you to try again.