Remix.run Logo
mystraline 3 days ago

> BP's modern version (also called the reverse mode of automatic differentiation)

So... Automatic integration?

Proportional, integrative, derivative. A PID loop sure sounds like what they're talking about.

eigenspace 3 days ago | parent | next [-]

Reverse move automatic differentiation is not integration. It's still differentiation, but just a different method of calculating the derivative than the one you'd think to do by hand. It basically just applies the chain rule in the opposite order from what is intuitive to people.

It has a lot more overhead than regular forwards mode autodiff because you need to cache values from running the function and refer back to them in reverse order, but the advantage is that for function with many many inputs and very few outputs (i.e. the classic example is calculating the gradient of a scalar function in a high dimensional space like for gradient descent), it is algorithmically more efficient and requires only one pass through the primal function.

On the other hand, traditional forwards mode derivatives are most efficient for functions with very few inputs, but many outputs. It's essentially a duality relationship.

stephencanon 3 days ago | parent [-]

I don't think most people think to do either direction by hand; it's all just matrix multiplication, you can multiply them in whatever order makes it easier.

eigenspace 3 days ago | parent [-]

Im just talking about the general algorithm to write down the derivative of `f(g(h(x)))` using the chain rule.

For vector valued functions, the naive way you would learn in a vector calculus class corresponds to forward mode AD.

imtringued 3 days ago | parent | prev | next [-]

Forward mode automatic differentiation creates a formula for each scalar derivative. If you have a billion parameters you have to calculate each derivative from scratch.

As the name implies, the calculation is done forward.

Reverse mode automatic differentiation starts from the root of the symbolic expression and calculates the derivative for each subexpression simultaneously.

The difference between the two is like the difference between calculating the Fibonacci sequence recursively without memoization and calculating it iteratively. You avoid doing redundant work over and over again.

digikata 3 days ago | parent | prev | next [-]

There are large bodies of work for optimization of state space control theory that I strongly suspect as a lot of crossover for AI, and at least has very similar mathematical structure.

e.g. optimization of state space control coefficients looks something like training a LLM matrix...

brosco 3 days ago | parent [-]

There is indeed a lot of crossover, and a lot of neural networks can be written in a state space form. The optimal control problem should be equivalent to training the weights, as you mention.

However, from what I have seen, this isn't really a useful way of reframing the problem. The optimal control problem is at least as hard, if not harder, than the original problem of training the neural network, and the latter has mature and performant software for doing it efficiently. That's not to say there isn't good software for optimal control, but it's a more general problem and therefore off-the-shelf solvers can't leverage the network structure very well.

Some researchers have made interesting theoretical connections like in neural ODEs, but even there the practicality is limited.

blt 2 days ago | parent [-]

Yes, in most cases the reduction of supervised learning to optimal control is not interesting.

We can also reduce supervised learning to reinforcement learning, but that doesn't mean we should use RL algorithms to do supervised learning.

We can also reduce sorting a list of integers to SAT, but that doesn't mean we should use a SAT solver to sort lists of integers.

3 days ago | parent | prev [-]
[deleted]