▲ | scellus 4 days ago | |
Priors on parameters are not an issue. On models of scale, priors are just some computationally convenient shrinkage, and what works is found empirically and canonized into the practice; projecting prior knowledge of the problem at hand by parameter priors does not really happen except in some vague sense ("I think most predictors are irrelevant, so make it sparse by Cauchy/horseshoe/whatever"). The important thing in bayesian (statistical, ML) modelling in general is the ability to gain in flexibility and do model structures that otherwise would be hard or impossible: latent states, hierarchies, etc. In bayesian NNs the main advantages would be around uncertainty quantification (UQ) and in finding good optima and partly to avoid overfitting. These do apply in some cases of simple NNs. Mostly however, especially with larger conventional models (not speaking of normalizing flows and such here), using explicit bayes is not feasible. Instead, people use approximate point estimates with tricks: (1) UQ has been taken care of by post-calibration. (2) Stochastic gradient actually searches for large posterior masses like a variational approximation would do, so it is kind of bayes. (3) And those priors: using dropout is commonplace, it has a bayesian interpretation, and L2 regularization aka gaussian priors are frequent too. So bayes is there in practice, just not in a neat, pure form but as a collection of practical hacks. |