Remix.run Logo
SpaceManNabs 3 days ago

Why do you think they perform so poorly?

dccsillag 3 days ago | parent [-]

Theory-wise, I'm not convinced that the models have good approximation properties (the Kolmogorov-Arnold / Kolmogorov Superposition Theorem they base themselves on has quite a bit of nuance), and the optimization problem might be a bit tricky. I'm also can't see how to incorporate inductive biases other than the standard R^n / tabular regression one, and the existing attempts on this that I'm aware of are just band-aids (along the lines of feature engineering).

In practice, I've personally ran some benchmarks on a collection of datasets I had laying around. The results were generally abysmal, with the method only matching simple baselines in some few datasets.

Finally, the original paper is very weird, and reads more as a marketing piece. The theory, which is touted throughout the paper, is very weak, the actual algorithm is not sufficiently well explained there and the experiments are lacking. In particular, I find it telling that they do not include and even go out of their way to ignore important baselines such as boosted trees, which are the state-of-the-art solution to the problem that they intended to solve (and even work very well in occasions where they claim that both KANs and MLPs perform badly, e.g. in high dimensions).

SpaceManNabs a day ago | parent [-]

Thanks for the detailed answer. So I guess the main issue with KANs is that they don't work as good. I wonder if that shortfall could be because we have spent more time setting up KANs for learning as much as we can for things like MLPs. I am not surprised though that KANs don't beat boosted trees and such. MLPs dont really either.

Only one follow up question:

> I'm also can't see how to incorporate inductive biases other than the standard R^n / tabular regression one, and the existing attempts on this that I'm aware of are just band-aids (along the lines of feature engineering)

A lot of the way we induct biases in the traditional network setting (activations are on the node instead of on the edge like in KAN) is by using graph-based architectures, like convolution or transformers, or by setting up particular losses and optimizations like in equivariant networks. Can't we do the same thing for KANs?