| ▲ | hodgehog11 6 hours ago |
| As someone who works in the area, this provides a decent summary of the most popular research items. The most useful and impressive part is the set of open problems at the end, which just about covers all of the main research directions in the field. The skepticism I'm seeing in the comments really highlights how little of this work is trickling down to the public, which is very sad to see. While it can offer few mathematical mechanisms to infer optimal network design yet (mostly because just trying stuff empirically is often faster than going through the theory, so it is more common to retroactively infer things), the question "why do neural networks work better than other models?" is getting pretty close to a solid answer. Problem is, that was never the question people seem to have ever really been interested in, so the field now has to figure out what questions we ask next. |
|
| ▲ | chadcmulligan 4 hours ago | parent | next [-] |
| "why do neural networks work better than other models?" That sounds really interesting - any references (for a non specialist)? |
| |
| ▲ | andbberger 2 hours ago | parent [-] | | https://en.wikipedia.org/wiki/Universal_approximation_theore... the better question is why does gradient descent work for them | | |
| ▲ | jmalicki 2 hours ago | parent | next [-] | | The properties that the uniform approximation theorem proves are not unique to neural networks. Any models using an infinite dimensional Hilbert space, such as SVMs with RBF or polynomial kernels, Gaussian process regression, gradient boosted decision trees, etc. have the same property (though proven via a different theorem of course). So the universal approximation theorem tells us nothing about why should expect neural networks to perform better than those models. | | |
| ▲ | hodgehog11 an hour ago | parent [-] | | Extremely well said. Universal approximation is necessary but not sufficient for the performance we are seeing. The secret sauce is implicit regularization, which comes about analogously to enforcing compression. |
| |
| ▲ | fc417fc802 2 hours ago | parent | prev [-] | | I don't follow. Why wouldn't it work? It seems to me that a biased random walk down a gradient is about as universal as it gets. A bit like asking why walking uphill eventually results in you arriving at the top. | | |
| ▲ | hodgehog11 an hour ago | parent [-] | | It wouldn't work if your landscape has more local minima than atoms in the known universe (which it does) and only some of them are good. Neural networks can easily fail, but there's a lot of things one can do to help ensure it works. | | |
| ▲ | anvuong 35 minutes ago | parent | next [-] | | A funny thing is, in very high-dimensional space, like millions and billions of parameters, the chance that you'd get stuck in a local minima is extremely small. Think about it like this, to be stuck in a local minima in 2D, you only need 2 gradient components to be zero, in higher dimension, you'd need every single one of them, millions up millions of them, to be all zero. You'd only need 1 single gradient component to be non-zero and SGD can get you out of it. Now, SGD is a stochastic walk on that manifold, not entirely random, but rather noisy, the chance that you somehow walk into a local minima is very very low, unless that is a "really good" local minima, in a sense that it dominates all other local minimas in its neighborhood. | |
| ▲ | appplication an hour ago | parent | prev [-] | | Not a mathematician so I’m immediately out of my depth here (and butchering terminology), but it seems, intuitively, like the presence of a massive amount of local minima wouldn’t really be relevant for gradient descent. A given local minimum would need to have a “well” at least be as large as your step size to reasonably capture your descent. E.g. you could land perfectly on a local minima but you won’t stay the unless your step size was minute or the minima was quite substantial. |
|
|
|
|
|
| ▲ | cookiengineer 2 hours ago | parent | prev | next [-] |
| In my opinion current research should focus on revisiting older concepts to figure out if they can be applied to transformers. Transformers are superior "database" encodings as the hype about LLMs points out, but there have been promising ML models that were focusing on memory parts for their niche use cases, which could be promising concepts if we could make them work with attention matrixes and/or use the frequency projection idea on their neuron weights. The way RNNs evolved to LSTMs, GRUs, and eventually DNCs was pretty interesting to me. In my own implementations and use cases I wasn't able to reproduce Deepmind's claims in the DNC memory related parts. Back at the time the "seeking heads" idea of attention matrixes wasn't there yet, maybe there's a way to build better read/write/access/etc gates now. [1] a fairly good implementation I found: https://github.com/joergfranke/ADNC |
|
| ▲ | mathisfun123 2 hours ago | parent | prev [-] |
| > why do neural networks work better than other models The only people for whom this is an open question are the academics - everyone else understands it's entirely because of the bagillions of parameters. |
| |
| ▲ | hodgehog11 2 hours ago | parent | next [-] | | No it isn't, and it's frustrating when the "common wisdom" tries to boil it down to this. If this was true, then the models with "infinitely many" parameters would be amazing. What about just training a gigantic two-layer network? There is a huge amount of work trying to engineer training procedures that work well. The actual reason is due to complex biases that arise from the interaction of network architectures and the optimizers and persist in the regime where data scales proportionally to model size. The multiscale nature of the data induces neural scaling laws that enable better performance than any other class of models can hope to achieve. | | |
| ▲ | skydhash 2 hours ago | parent [-] | | > The actual reason is due to complex biases that arise from the interaction of network architectures and the optimizers and persist in the regime where data scales proportionally to model size. The multiscale nature of the data induces neural scaling laws that enable better performance than any other class of models can hope to achieve. That’s a lot of words to say that, if you encode a class of things as numbers, there’s a formula somewhere that can approximate an instance of that class. It works for linear regression and works as well for neural network. The key thing here is approximation. | | |
| |
| ▲ | tacet 2 hours ago | parent | prev [-] | | Also massive human work done on them, that wasn't done before. Data labeling is pretty big industry in some countries and I guess dropping 200 kilodollars on labeling is beyond the reach of most academics, even if they would not care about ethics of that. |
|