Remix.run Logo
datastoat 4 days ago

I like Bayesian inference for few-parameter models where I have solid grounds for choosing my priors. For neural networks, I like to ask people "what's your prior for ReLU versus LeakyReLU versus sigmoid?" and I've never gotten a convincing answer.

stormfather 4 days ago | parent | next [-]

I choose LeakyReLU vs ReLU depending on if it's an odd day of the week, LeakyReLU being the slightly favored odd-days because it's aesthetically nicer that gradients propagate through negative inputs, though I can't discern a difference. I choose sigmoid if I want to waste compute to remind myself that it converges slowly due to vanishing gradients at extreme activation levels. So its empiricism retroactively justified by some mathematical common sense that let's me feel good about the choices. Kind of like aerodynamics.

duvenaud 4 days ago | parent | prev | next [-]

I agree choosing priors is hard, but choosing ReLU versus LeakyReLU versus sigmoid seems like a problem with using neural nets in general, not Bayesian neural nets in particular. Am I misunderstanding?

pkoird 4 days ago | parent | prev | next [-]

Kolmogorov Arnold nets might have an answer for you!

dccsillag 4 days ago | parent | next [-]

Ah, Kolmogorov Arnold Networks. Perhaps the only model I have ever tried that managed to fairly often get AUCs below 0.5 in my tabular ML benchmarks. It even managed to get a frankly disturbing 0.33, where pretty much any other method (including linear regression, IIRC) would get >=0.99!

SpaceManNabs 3 days ago | parent [-]

Why do you think they perform so poorly?

dccsillag 3 days ago | parent [-]

Theory-wise, I'm not convinced that the models have good approximation properties (the Kolmogorov-Arnold / Kolmogorov Superposition Theorem they base themselves on has quite a bit of nuance), and the optimization problem might be a bit tricky. I'm also can't see how to incorporate inductive biases other than the standard R^n / tabular regression one, and the existing attempts on this that I'm aware of are just band-aids (along the lines of feature engineering).

In practice, I've personally ran some benchmarks on a collection of datasets I had laying around. The results were generally abysmal, with the method only matching simple baselines in some few datasets.

Finally, the original paper is very weird, and reads more as a marketing piece. The theory, which is touted throughout the paper, is very weak, the actual algorithm is not sufficiently well explained there and the experiments are lacking. In particular, I find it telling that they do not include and even go out of their way to ignore important baselines such as boosted trees, which are the state-of-the-art solution to the problem that they intended to solve (and even work very well in occasions where they claim that both KANs and MLPs perform badly, e.g. in high dimensions).

SpaceManNabs a day ago | parent [-]

Thanks for the detailed answer. So I guess the main issue with KANs is that they don't work as good. I wonder if that shortfall could be because we have spent more time setting up KANs for learning as much as we can for things like MLPs. I am not surprised though that KANs don't beat boosted trees and such. MLPs dont really either.

Only one follow up question:

> I'm also can't see how to incorporate inductive biases other than the standard R^n / tabular regression one, and the existing attempts on this that I'm aware of are just band-aids (along the lines of feature engineering)

A lot of the way we induct biases in the traditional network setting (activations are on the node instead of on the edge like in KAN) is by using graph-based architectures, like convolution or transformers, or by setting up particular losses and optimizations like in equivariant networks. Can't we do the same thing for KANs?

jwuphysics 4 days ago | parent | prev [-]

Could you say a bit more about how so?

pkoird 4 days ago | parent [-]

KANs have learnable activations based on splines parameterized on few variables. You can specify a prior over those variables, effectively establishing a prior over your activation function.

salty_biscuits 4 days ago | parent | prev [-]

I'm sure there is a way of interpreting a relu as a sparsity prior on the layer.