| ▲ | peter_d_sherman 2 hours ago | |
>"To assess the interpretability of our models, we isolate the small sparse circuits that our models use to perform each task using a novel pruning method. Since interpretable models should be easy to untangle, individual behaviors should be implemented by compact standalone circuits. Sparse circuits are defined as a set of nodes connected by edges." ...which could also be considered/viewed as Graphs... (Then from earlier in the paper): >"We train models to have more understandable circuits by constraining most of their weights to be zeros, so that each neuron only has a few connections. To recover fine-grained circuits underlying each of several hand-crafted tasks, we prune the models to isolate the part responsible for the task. These circuits often contain neurons and residual channels that correspond to natural concepts, with a small number of straightforwardly interpretable connections between them. And (jumping around a bit more in the paper): >"A major difficulty for interpreting transformers is that the activations and weights are not directly comprehensible; for example, neurons activate in unpredictable patterns that don’t correspond to human-understandable concepts. One hypothesized cause is superposition (Elhage et al., 2022b), the idea that dense models are an approximation to the computations of a much larger untangled sparse network." A very interesting paper -- and a very interesting postulated potential relationship with superposition! (which also could be related to data compression... and if so, in turn, by relationship, potentially entropy as well...) Anyway, great paper! | ||