RL is more information inefficient than you thought

andyjohnson0 2 hours ago | parent | next [-]

Since it is not explicitly stated, "RL" in this article means Reinforcement Learning.

https://en.wikipedia.org/wiki/Reinforcement_learning

	▲	quote 16 minutes ago \| parent [-]
		I, too, started parsing this as RL=real life and that’s why I found the headline interesting

▲

scaredginger an hour ago | parent | prev | next [-]

Bit of a nitpick, but I think his terminology is wrong. Like RL, pretraining is also a form of *un*supervised learning

▲

cubefox 36 minutes ago | parent [-]

Usual terminology for the three main learning paradigms:

- Supervised learning (e.g. matching labels to pictures)

- unsupervised learning / self-supervised learning (pretraining)

- reinforcement learning

Now the confusing thing is that Dwarkesh Patel instead calls pretraining "supervised learning" and you call reinforcement learning a form of unsupervised learning.

	▲	thegeomaster 10 minutes ago \| parent [-]
		[delayed]

▲

macleginn an hour ago | parent | prev [-]

In the limit, the "happy" case (positive reward), policy gradients boil down to performing more or less the same update as the usual supervised strategy for each generated token (or some subset of those if we use sampling). In the unhappy case, they penalise the model for selecting particular tokens in particular circumstances -- this is not something you can normally do with supervised learning, but it is unclear to what extent this is helpful (if a bad and a good answer share a prefix, it will be upvoted in one case and penalised in another case, not in the same exact way but still). So during on-policy learning we desperately need the model to stumble on correct answers often enough, and this can only happen if the model knows how to solve the problem to begin with, otherwise the search space is too big. In other words, while in supervised learning we moved away from providing models with inductive biases and trusting them to figure out everything by themselves, in RL this does not really seem possible.

▲

sgsjchs 36 minutes ago | parent [-]

The trick is to provide dense rewards, i.e. not only once full goal is reached, but a little bit for every random flailing of the agent in the approximately correct direction.

	▲	thegeomaster 6 minutes ago \| parent [-]
		[delayed]