Remix.run Logo
andy12_ 4 hours ago

> Even with continuous backpropagation and "learning"

That's what I said. Backpropagation cannot be enough; that's not how neurons work in the slightest. When you put biological neurons in a Pong environment they learn to play not through some kind of loss or reward function; they self-organize to avoid unpredictable stimulation. As far as I know, no architecture learns in such an unsupervised way.

https://www.sciencedirect.com/science/article/pii/S089662732...

torginus 2 hours ago | parent [-]

Forgive me for being ignorant - but 'loss' in supervised learning ML context encode the difference between how unlikely (high loss) or likely (low loss) was the network in predicting the output based on the input.

This sounds very similar to me as to what neurons do (avoid unpredictable stimulation)

andy12_ an hour ago | parent [-]

So, I have been thinking about this for a little while. Image a model f that takes a world x and makes a prediciton y. At a high-level, a traditional supervised model is trained like this

f(x)=y' => loss(y',y) => how good was my prediction? Train f through backprop with that error.

While a model trained with reinforcement learning is more similar to this. Where m(y) is the resulting world state of taking an action y the model predicted.

f(x)=y' => m(y')=z => reward(z) => how good was the state I was in based on my actions? Train f with an algorithm like REINFORCE with the reward, as the world m is a non-differentiable black-box.

While a group of neurons is more like predicting what is the resulting word state of taking my action, g(x,y), and trying to learn by both tuning g and the action taken f(x).

f(x)=y' => m(y')=z => g(x,y)=z' => loss(z,z') => how predictable was the results of my actions? Train g normally with backprop, and train f with an algorithm like REINFORCE with negative surprise as a reward.

After talking with GPT5.2 for a little while, it seems like Curiosity-driven Exploration by Self-supervised Prediction[1] might be an architecture similar to the one I described for neurons? But with the twist that f is rewarded by making the prediction error bigger (not smaller!) as a proxy of "curiosity".

[1] https://arxiv.org/pdf/1705.05363