As someone who works in the theory of deep learning, I'm curious to know what you mean by "understand how they work". Sure, we understand the broad strokes, and those tend to be good enough to chip away at making improvements and vague predictions of what certain things will do, but we still have a severe lack of understanding of their implicit biases which allow them to perform well and give rise to the scaling laws.

We know a lot more about basically every other machine learning approach.

▲

d4rkn0d3z 2 days ago | parent [-]

Well, I'm thumb bashing on my phone so I can't get too far into the details as I see them. I'd read and respond to any papers you have?

	▲	hodgehog11 2 days ago \| parent [-]
		Are you asking about open problems? Here are a couple: There are strong arguments that deep learning generalization (and robustness in a slightly different sense) can be successfully explained in terms of PAC-Bayes theory, for example: https://arxiv.org/abs/2503.02113 This requires a strong understanding of the appropriate prior, which effectively quantifies implicit bias in the training setup / architecture setup. Early analyses suggest the bias is that of a minimal gradient norm (https://arxiv.org/pdf/2101.12176) but this is almost certainly not the case, e.g.(https://arxiv.org/abs/2005.06398). There are good empirical arguments that models are trained to achieve maximal compression (https://arxiv.org/abs/2211.13609). But we only know this in a vague sense. How are deep learning training procedures and architectures implicitly exhibiting this form of sparsity and compressing information? Why is this more effective for some optimizers/architectures than others? Why would minimum description length circuits be so effective at mimicking human intelligence? How does this form of regularisation induce the exponents in the neural scaling laws (so we can figure out how to best optimize for these exponents)? If we don't know the implicit bias and how it relates to the training procedure, we're just doing alchemy and effectively got lucky. Change the standard approach even slightly and things easily break. We only really know what to do by process of elimination. Then there are the LLM specific questions related to in-context learning (https://arxiv.org/abs/2306.00297), to help explain why next token prediction through transformer designs works so well. Does it implicitly act as an optimization procedure (with bias) relative to the context? What are the biases here? What is the problem being solved? These may sound specific, but they are just examples of some approaches to answer the big question of "why does this standard procedure work so well, and what can we do to make it work better". Without this, we're just following the heuristics of what came before without a guiding framework. Post-hoc, we can do a lot more with interpretability frameworks and linear probes and the like. But without the implicit biases, we are only getting a local picture, which doesn't help us understand what a model is likely to do before we run it. We need global information. These are just a few, and don't include the philosophical and theory of mind questions, and others from neuroscience...