Despite the common refrain about how different symbolic differentiation and AD are, they are actually the same thing.

Not at all.

There are mainly 2 forms of AD: forward mode (optimal when the function being differentiated has more outputs than latent parameter inputs) and reverse mode (when it has more latent parameter inputs than outputs). If you don't understand why, you don't understand AD.

If you understand AD, you'd know why, but then you'd also see a huge difference with symbolic differentiation. In symbolic differentiation, input is an expression or DAG, the variables being computed along the way are similar such symbolic expressions (typically computed in reverse order in high school or uni, so the expression would grow exponentially with each deeper nested function, and only at the end are the input coordinates filled into the final expression, to end up with the gradient). Both forward and reverse mode have numeric variables being calculated, not symbolic expressions.

The third "option" is numeric differentiation, but for N latent parameter inputs this requires (N+1) forward evaluations: N of the function f(x1,x2,..., xi + delta, ..., xN) and 1 reference evaluation at f(x1, ..., xN). Picking a smaller delta makes it closer to a real gradient assuming infinite precision, but in practice there will be irregular rounding near the pseudo "infinitesimal" values of real world floats; alternatively take delta big enough, but then its no longer the theoretical gradient.

So symbolic differentiation was destined to fail due to ever increasing symbolic expression length (the chain rule).

Numeric differentiation was destined to fail due to imprecise gradient computation and huge amounts (N+1, many billions for current models) of forward passes to get a single (!) gradient.

AD gives the theoretically correct result with a single forward and backward pass (as opposed to N+1 passes), without requiring billions of passes, or lots of storage to store strings of formulas.

	▲	whimsicalism 3 days ago \| parent [-]
		I simply do not agree that you are making a real distinction and I think comments like "If you don't understand why, you don't understand AD" are rude. AD is just simple application of the pushbacks/pullforwards from differential geometry that are just the chain rule. It is important to distinguish between a mathematical concept and a particular algorithm/computation for implementing it. The symbolic manipulation with an 'exponentially growing nested function' is a particular way of applying the chain rule, but it is not the only way. The problem you describe with symbolic differentiation (exponential growth of expressions) is not inherent to symbolic differentiation itself, but to a particular naïve implementation. If you represent computations as DAGs and apply common subexpression elimination, the blow-up you mention can be avoided. In fact, forward- and reverse-mode AD can be viewed as particular algorithmic choices for evaluating the same derivative information that symbolic differentiation encodes. If you represent your function as a DAG and propagate pushforwards/pullbacks, you’ve already avoided swell https://emilien.ca/Notes/Notes/notes/1904.02990v4.pdf