Remix.run Logo
brendanashworth 4 hours ago

Is there a reason people don't use SHAP [1] to interpret language models more often? The in-context attribution of outputs seems very similar.

[1] https://shap.readthedocs.io/en/latest/

dwohnitmok 3 hours ago | parent [-]

SHAP would be absurdly expensive to do for even tiny models (naive SHAP scales exponentially in the number of parameters; you can sample your coalitions to do better but those samples are going to be ridiculously sparse when you're talking about billions of parameters) and provides very little explanatory power for deep neural nets.

SHAP basically does point by point ablation across all possible subsets, which really doesn't make sense for LLMs. This is simultaneously too specific and too general.

It's too specific because interesting LLM behavior often requires talking about what ensembles of neurons do (e.g. "circuits" if you're of the mechanistic interpretability bent), and SHAP's parameter-by-parameter approach is completely incapable of explaining this. This is exacerbated by the other that not all neurons are "semantically equal" in a deep network. Neurons in the deeper layers often do qualitatively different things than earlier layers and the ways they compose can completely confuse SHAP.

It's too general because parameters often play many roles at once (one specific hypothesis here is the superposition hypothesis) and so you need some way of splitting up a single parameter into interpretable parts that SHAP doesn't do.

I don't know the specifics of what this particular model's approach is.

But SHAP unfortunately does not work for LLMs at all.