▲ | weinzierl 6 days ago | |
"Now that we understand differentiation, let’s move on to programming. So far, we’ve only considered mathematical functions, but we can easily translate our perspective to programs. For simplicity, we’ll only consider pure functions, i.e. functions whose output depends solely on its parameters (no state)." I think I've seen this notion that the constraint is pureness also in documentation of autodiff libraries, but this cannot be strong enough, right? It easy enough to come up with functions that are nowhere differentiable. So my question is, what are the actual requirements a state of the art autodiff library has for the input function and why do people focus on the pureness aspect if that is probably the least of the problems. | ||
▲ | yorwba 6 days ago | parent | next [-] | |
If the function is pure, autodiff can produce a result. If the function is not differentiable for a given input, the result produced by autodiff of course cannot be the derivative (since none exists) but it could be another number (e.g. a subderivative) that is useful in some of the ways a derivative would have been useful. | ||
▲ | grandempire 6 days ago | parent | prev | next [-] | |
State is just values which are not considered to be variable function inputs - in other words each time you change state you have new functions. For example f(x, y) = xy and then defining a differentiable function g(x) = f(x, a). You can imagine “a” being a state variable. | ||
▲ | hansvm 5 days ago | parent | prev [-] | |
Purity, interestingly, is not a requirement. You can represent any closure as something implicitly captured, and you can represent mutable state as an immutable input with a different, related, immutable output. For any autodiff library not also throwing in dense+sparse decompositions (none of the major ones do any such optimization; the programmer has to manually choose which things to represent how), that doesn't even waste any more space for reverse-mode autodiff than your other options (i.e., you need to maintain the before and after state anyway, give or take your particular checkpointing scheme, and if you don't have a way to represent the delta cheaply then contrasted with the underlying mutable forward pass which probably was faster and cheaper than the alternatives, the differentiation is probably quite expensive). Purity is often easier to code around, especially in "complicated" languages with many features you'd like to support, but it's not mandatory. In terms of actual requirements, something that's sufficient [0] is for every sub-component to be differentiable and for no dynamic control flow to depend on the things being differentiated. In practice, most libraries wind up requiring something like this, mostly because it's very hard to do anything else. As an example, define f(x) := 0 for floats with an even LSB and 1 for floats with an odd LSB. Define g(x) := 1 - f(x). Neither of these are meaningfully differentiable, but g(x) + f(x) is identically equal to one. Autodiff relies crucially on the fact that it can perform local transformations, and that sort of whole-program analysis is (a) impossible in general, and (b) hard even when it's possible. For local-only autodiff (the only thing people ever do), the thing that's necessary is for every sub-component to have a derivative-like operator defined such that if the sub-components are composed into a differentiable function then the normal chain rule and other autodiff compositions of those operators is also differentiable and represents the derivative in question (along with some requirements on dynamic control flow -- they don't have to be quite as strict as I described, but it's impossible to relax in general that with local-only autodiff, so that dynamic requirement from the above paragraph is also necessary). There are few (zero?) components where that's possible -- an adversary can always come up with a composition violating the derivative being incorrect. However, for some interesting functions (like eigenvalues and eigenvectors) in the normal way people use them, these sorts of things can be defined. E.g., the eigenvalue derivative is not unique (up to a choice of phase), but if your composition also doesn't depend on phase then you're still fine. [0] Even for things like differentiating through a loop converging to a value, this property holds, with one meaningful exception: The error in the derivative compared with the true function you're approximating will still converge to zero with enough iterations, but that number can be much higher than you need to get the function itself to converge. You _will_, however, get the derivative of the approximation perfect. |