There's something that's always been deeply confusing to me about comparing the Jacobian and the Hessian because their nature is very different.

The Hessian shouldn't have been called a matrix.

The Jacobian describes all the first order derivatives of a vector valued function (of multiple inputs), while the Hessian is all the second order derivatives of a scalar valued output function (of multiple inputs). Why doesn't the number of dimensions of the array increase by one as the derivation order increases? It does! The object that fully describes second order derivation of a vector valued function of multiple inputs is actually a 3 dimensionnal tensor. One dimension for the original vector valued output, and one for each derivation order. Mathematicians are afraid of tensors of more than 2 dimensions for some reason and want everything to be a matrix.

In other words, given a function R^n -> R^m:

Order 0: Output value: 1d array of shape (m) (a vector)

Order 1: First order derivative: 2d array of shape (m, n) (Jacobian matrix)

Order 2: Second order derivative: 3d array of shape (m, n, n) (array of Hessian matrices)

It all makes sense!

Talking about "Jacobian and Hessian" matrices as if they are both naturally matrices is highly misleading.

▲

ndriscoll 3 days ago | parent | next [-]

At least in my undergrad multivariate real analysis class, I remember the professor arranging things to strongly suggest that the Hessian should be thought of as ∇⊗∇, and that this was the second term in a higher dimensional Taylor series, so that the third derivative term would be ∇⊗∇⊗∇ etc. Things like tensor products or even quotient spaces weren't assumed knowledge, so it wasn't explicitly covered, but I remember feeling the connection was obvious enough at the time. Then an introductory differential geometry class got into (n,m) tensors. So I'm quite sure mathematicians are fine dealing with tensors. My experience was undergrad engineering math tries to avoid even covectors though, so that will stay well clear of a coherent picture of multi-variable calculus. e.g. my engineering professors would talk of dirac δ as an infinite spike/spooky doesn't-really-exist thing that makes integrals work or whatever. My analysis professor just said δ(f) = f(0) is a linear functional.

▲

tho2342i342342 3 days ago | parent [-]

∇⊗∇ would be more like \p_i f . \p_j f, not \p_{ij} f

▲

setopt 3 days ago | parent [-]

I disagree, if you apply it in the order (∇⊗∇) f then you should get \partial_i \partial_j as elements of a rank-2 operator that is then applied to a function f. That is, presumably, what you mean by \p_{ij} f.

	▲	ndriscoll 3 days ago \| parent \| next [-]
		I think it is actually bad notation. It was a long time ago but I think the suggestive nod was basically pointing out how the entries of the matrix are arranged as "copies of del" and that this thing should be thought of as eating 2 infinitesimal vectors. I don't think he actually wrote that (again, the prereqs didn't make sense to try to do so), but afaik some people do use that notation. I think if you tried to actually make sense of it, you'd expect that you plug in f twice (I think ∇ already secretly lives in R^n* ⊗ L(C^infty, C^infty), so you'd expect 2 copies of each of those). If you think of it as ∇: L(C^infty, L(R^n, C^infty)) then the composition ∇^2: L(C^infty, L(R^n, L(R^n, C^infty))) ≅ C^infty* ⊗ R^n* ⊗ R^n* ⊗ C^infty, so the types should work correctly (though the type of ∇ actually needs some generic parameters for composition to make sense), and you get that you only plug f in once. Unfortunately ∇^2 is already taken by the Laplacian, so I suppose ∇⊗∇ would be "∇^2, but the one that makes a tensor", and it'll do as long as you don't try to type check it, which physicists won't.
	▲	3 days ago \| parent \| prev [-]
		[deleted]

▲

imtringued 3 days ago | parent | prev | next [-]

You're confusing too many things.

The Hessian is defined as the second order partial derivative of a scalar function. Therefore it will always give you a matrix.

What you're doing with the shape (m,n,n) isn't actually guaranteed at all since the output shape of an arbitrary function can be any tensor and you can apply the Hessian to each scalar value in the tensor to get another arbitrary tensor that has two dimensions more.

It's the Jacobian that is weird, since it is just a vector of gradients and therefore its partial derivative must also be a vector of Hessians.

▲

mcabbott 3 days ago | parent | prev | next [-]

This doesn't really help with programming, but in physics it's traditional to use up- and down-stairs indices, which makes the distinction you want very clear.

If input x has components xⁿ, and output f(x) components fᵐ, then the Jacobian is ∂ₙfᵐ which has one index upstairs and one downstairs. The derivative has a downstairs index... because x is in the denominator of d/dx, roughly? If x had units seconds, then d/dx has units per second.

Whereas if g(x) is a number, the gradient is ∂ₙg, and the Hessian is ∂ₙ∂ₙ₂g with two downstairs indices. You might call this a (0,2) tensor, while the Jaconian is (1,1). Most of the matrices in ordinary linear algebra are (1,1) tensors.

▲

flufluflufluffy 3 days ago | parent [-]

We always referred to them as super/sub-scripts. So like xₙ is read “x sub n”

Upstairs/downstairs is kinda cute tho xD

	▲	mcabbott 3 days ago \| parent [-]
		Covariant and contravariant indices would be the formal terms. I'm not really sure whether I've seen "upstairs" written down. Sub/superscript... strike me as the typographical terms, not the meaning? Like $x_\mathrm{alice}$ is certainly a subscript, and footnote 2 is a superscript, but neither is an index.

▲

xigoi 3 days ago | parent | prev | next [-]

I’ve been introduced to the Hessian in the context of finding the extrema of functions with multiple variables, where it does not make sense to consider arbitrary output dimensions (what is the minimum of a vector function?). In this context, it is also important to find the definiteness of the underlying quadratic form, which is easier if you treat it as a matrix se you can apply Sylvester’s rule.

▲

cdavid 3 days ago | parent | prev | next [-]

I agree it is confusing, because starting with notation will confuse you. I personally don't like the partial derivative-first definition of those concepts, as it all sounds a bit arbitrary.

What made sense to me is to start from the definition of derivative (the best linear approximation in some sense), and then everything else is about how to represent this. vectors, matrices, etc. are all vectors in the appropriate vector space, the derivative is always the same form in a functional form, etc.

E.g. you want the derivative of f(M) ? Just write f(M+h) - f(M), and then look for the terms in h / h^2 / etc. Apply chain rules / etc. for more complicated cases. This is IMO a much better way to learn about this.

As for notation, you use vec/kronecker product for complicated cases: https://janmagnus.nl/papers/JRM093.pdf

▲

kandel 3 days ago | parent | prev | next [-]

Well for me the Hessian is the second order derivative in the special case where the co-domain is of dim 1. It's just very easy to work with...

▲

brantmv 3 days ago | parent | prev [-]

Mathematicians are afraid of higher order tensors because they are unruly monsters.

There's a whole workshop of useful matrix tools. Decompositions, spectral theory, etc. These tools really break down when you generalize them to k-tensors. Even basic concepts like rank become sticky. (Iirc, the set of 3-tensors of tensor rank ≤k is not even topologically closed in general. Terrifying.) If you hand me some random 5-tensor, it's quite difficult to begin to understand it without somehow turning it into a matrix first by flattening or slicing or whatever.

Don't get me wrong. People work with these things. They do their best. But in general, mathematicians are afraid of higher order tensors. You should be too.