Weight-sparse transformers have interpretable circuits [pdf]

lambdaone 17 minutes ago | parent | next [-]

I find this fascinating, as it raises the possibility of a single framework that can unify neural and symbolic computation by "defuzzing" activations into what are effectively symbols. Has anyone looked at the possibility of going the other way, by fuzzifying logical computation?

	▲	radarsat1 9 minutes ago \| parent [-]
		> fuzzifying logical computation? Isn't that basically what the sigmoid operator does? Or more in the direction of averaging many logical computations, we have random forests.

▲

oli5679 20 minutes ago | parent | prev | next [-]

This ties directly into the superposition theory.

It is believed dense models cram many features into shared weights, making circuits hard to interpret.

Sparsity reduces that pressure by giving features more isolated space, so individual neurons are more likely to represent a single, interpretable concept.

▲

peter_d_sherman 2 hours ago | parent | prev [-]

>"To assess the interpretability of our models, we isolate the small sparse circuits that our models use to perform each task using a novel pruning method. Since interpretable models should be easy to untangle, individual behaviors should be implemented by compact standalone circuits.

Sparse circuits are defined as a set of nodes connected by edges."

...which could also be considered/viewed as Graphs...

(Then from earlier in the paper):

>"We train models to have more understandable circuits by constraining most of their weights to be zeros, so that each neuron only has a few connections. To recover fine-grained circuits underlying each of several hand-crafted tasks, we prune the models to isolate the part responsible for the task. These circuits often contain neurons and residual channels that correspond to natural concepts, with a small number of straightforwardly interpretable connections between them.

And (jumping around a bit more in the paper):

>"A major difficulty for interpreting transformers is that the activations and weights are not directly comprehensible; for example, neurons activate in unpredictable patterns that don’t correspond to human-understandable concepts. One hypothesized cause is superposition (Elhage et al., 2022b), the idea that dense models are an approximation to the computations of a much larger untangled sparse network."

A very interesting paper -- and a very interesting postulated potential relationship with superposition! (which also could be related to data compression... and if so, in turn, by relationship, potentially entropy as well...)

Anyway, great paper!