Remix.run Logo
srean 15 hours ago

People go all dopey eyed about "frequency space", that's a red herring. The take away should be that a problem centric coordinate system is enormously helpful.

After all, what Copernicus showed is that the mind bogglingly complicated motion of planets become a whole lot simpler if you change the coordinate system.

Ptolemaic model of epicycles were an adhoc form of Fourier analysis - decomposing periodic motions over circles over circles.

Back to frequencies, there is nothing obviously frequency like in real space Laplace transforms *. The real insight is that differentiation and integration operations become simple if the coordinates used are exponential functions because exponential functions remain (scaled) exponential when passed through such operations.

For digital signals what helps is Walsh-Hadamard basis. They are not like frequencies. They are not at all like the square wave analogue of sinusoidal waves. People call them sequency space as a well justified pun.

My suspicion is that we are in Ptolemaic state as far as GPT like models are concerned. We will eventually understand them better once we figure out what's the better coordinate system to think about their dynamics in.

* There is a connection though, through the exponential form of complex numbers, or more prosaically, when multiplying rotation matrices the angles combine additively. So angles and logarithms have a certain unity, or character.

madhadron 13 hours ago | parent | next [-]

All these transforms are switching to an eigenbasis of some differential operator (that usually corresponds to a differential equation of interest). Spherical harmonics, Bessel and Henkel functions, which are the radial versions of sines/cosines and complex exponential, respectively, and on and on.

The next big jumps were to collections of functions not parameterized by subsets of R^n. Wavelets use a tree shapes parameter space.

There’s a whole, interesting area of overcomplete basis sets that I have been meaning to look into where you give up your basis functions being orthogonal and all those nice properties in exchange for having multiple options for adapting better to different signal characteristics.

I don’t think these transforms are going to be relevant to understanding neural nets, though. They are, by their nature, doing something with nonlinear structures in high dimensions which are not smoothly extended across their domain, which is the opposite problem all our current approaches to functional analysis deal with.

srean 11 hours ago | parent | next [-]

You may well be right about neural networks. Sometimes models that seem nonlinear turns linear if those nonlinearities are pushed into the basis functions, so one can still hope.

For GPT like models, I see sentences as trajectories in the embedded space. These trajectories look quite complicated and no obvious from their geometrical stand point. My hope is that if we get the coordinate system right, we may see something more intelligible going on.

This is just a hope, a mental bias. I do not have any solid argument for why it should be as I describe.

nihzm 9 hours ago | parent | next [-]

> Sometimes models that seem nonlinear turns linear if those nonlinearities are pushed into the basis functions, so one can still hope.

That idea was pushed to its limit by the Koopman operator theory. The argument sounds quite good at first, but unfortunately it can’t really work for all cases in its current formulation [1].

[1]: https://arxiv.org/abs/2407.08177

srean 8 hours ago | parent [-]

Quite so. Quite so indeed.

We know that under benign conditions and infinite dimensional basis must exist but finding it from finite samples is very non-trivial, we don't know how to do it in the general case.

madhadron 6 hours ago | parent | prev [-]

I’m not sure what you mean by a change of basis making a nonlinear system linear. A linear system is one where solutions add as elements of a vector space. That’s true no matter what basis you express it in.

fc417fc802 13 hours ago | parent | prev [-]

Note that I'm not great at math so it's possible I've entirely misunderstood you.

Here's an example of directly leveraging a transform to optimize the training process. ( https://arxiv.org/abs/2410.21265 )

And here are two examples that apply geometry to neural nets more generally. ( https://arxiv.org/abs/2506.13018 ) ( https://arxiv.org/abs/2309.16512 )

nihzm 12 hours ago | parent | next [-]

From the abstract and skimming a few sections of the first paper, imho it is not really the same. The paper is moving the loss gradient to the tangent dual space where weights reside for better performance in gradient descent, but as far as I understand neither the loss function nor the neural net are analyzed in a new way.

The Fourier and Wavelet transforms are different as they are self-adjoint operators (=> form an orthogonal basis) on the space of functions (and not on a finite dimensional vector space of weights that parametrize a net) that simplify some usually hard operators such as derivatives and integrals, by reducing them to multiplications and divisions or to a sparse algebra.

So in a certain sense these methods are looking at projections, which are unhelpful when thinking about NN weights since they are all mixed with each other in a very non-linear way.

srean 11 hours ago | parent | prev [-]

Thanks a bunch for the references. Reading the abstract these used a different idea compared to what Fourier analysis is about, but nonetheless should be a very interesting read.

anamax 3 hours ago | parent | prev | next [-]

> My suspicion is that we are in Ptolemaic state as far as GPT like models are concerned. We will eventually understand them better once we figure out what's the better coordinate system to think about their dynamics in.

Most deep learning systems are learned matrices that are multiplied by "problem-instance" data matrices to produce a prediction matrix. The time to do said matrix-multiplication is data-independent (assuming that the time to do multiply-adds is data-independent).

If you multiply both sides by the inverse of the learned matrix, you get an equation where finding the prediction matrix is a solving problem, where the time to solve is data dependent.

Interestingly enough, that time is sort-of proportional to the difficulty of the problem for said data.

Perhaps more interesting is that the inverse matrix seems to have row artifacts that look like things in the training data.

These observations are due to Tsvi Achler.

RossBencina 3 hours ago | parent | prev | next [-]

> exponential functions remain (scaled) exponential when passed through such operations.

See also: eigenvalue, differential operator, diagonalisation, modal analysis

alexlesuper 15 hours ago | parent | prev | next [-]

I feel like this is the way we should have learned Fourier and Laplace transforms in my DSP class. Not just blindly applying formulas and equations.

patentatt 14 hours ago | parent [-]

I’d argue that most if not all of the math that I learned in school could be distilled down to analyzing problems in the correct coordinate system or domain! The actual manipulation isn’t that esoteric once you get in the right paradigm. And those professors never explained things at that kind of higher theoretical level, all I remember was the nitty gritty of implementation. What a shame. I’m sure there’s higher levels of mathematics that go beyond my simplistic understanding, but I’d argue it’s enough to get one through the full sequence of undergraduate level (electrical) engineering, physics, and calculus.

Xcelerate 13 hours ago | parent | prev [-]

It’s kind of intriguing that predicting the future state of any quantum system becomes almost trivial—assuming you can diagonalize the Hamiltonian. But good luck with that in general. (In other words, a “simple” reference frame always exists via unitary conjugation, but finding it is very difficult.)

srean 12 hours ago | parent [-]

Indeed.

It's disconcerting at times, the scope of finite and infinite dimensional linear algebra, especially when done on a convenient basis.