Show HN: Steerling-8B, a language model that can explain any token it generates

gormen 41 minutes ago | parent | next [-]

Most interpretability methods fail for LLMs because they try to explain outputs without modeling the intent, constraints, or internal structure that produced them. Token‑level attribution is useful, but without a framework for how the model reasons, you’re still explaining shadows on the wall.

▲

ottah 33 minutes ago | parent | prev | next [-]

It's a neat party trick, but explainability it's not solution to any AI safety issue I care about. It's a distraction from real problems, which is everything else around the model. The inflexible bureaucratic systems that make it hard to exercise rights and deflect accountability.

▲

umairnadeem123 26 minutes ago | parent | prev | next [-]

the practical value here is for regulated domains. in healthcare and finance you often cant deploy a model at all unless you can explain why it made a specific decision. token-level attribution that traces back to training data sources could satisfy audit requirements that currently block LLM adoption entirely.

curious how the performance compares to a standard llama 8b on benchmarks - interpretability usually comes with a quality tax.

▲

brendanashworth 2 hours ago | parent | prev | next [-]

Is there a reason people don't use SHAP [1] to interpret language models more often? The in-context attribution of outputs seems very similar.

[1] https://shap.readthedocs.io/en/latest/

	▲	dwohnitmok an hour ago \| parent [-]
		SHAP would be absurdly expensive to do for even tiny models (naive SHAP scales exponentially in the number of parameters; you can sample your coalitions to do better but those samples are going to be ridiculously sparse when you're talking about billions of parameters) and provides very little explanatory power for deep neural nets. SHAP basically does point by point ablation across all possible subsets, which really doesn't make sense for LLMs. This is simultaneously too specific and too general. It's too specific because interesting LLM behavior often requires talking about what ensembles of neurons do (e.g. "circuits" if you're of the mechanistic interpretability bent), and SHAP's parameter-by-parameter approach is completely incapable of explaining this. This is exacerbated by the other that not all neurons are "semantically equal" in a deep network. Neurons in the deeper layers often do qualitatively different things than earlier layers and the ways they compose can completely confuse SHAP. It's too general because parameters often play many roles at once (one specific hypothesis here is the superposition hypothesis) and so you need some way of splitting up a single parameter into interpretable parts that SHAP doesn't do. I don't know the specifics of what this particular model's approach is. But SHAP unfortunately does not work for LLMs at all.

▲

pbmango 2 hours ago | parent | prev | next [-]

This is very interesting. I don't see much discussion of interpretability in day to the day discourse of AI builders. I wonder if everyone assumes it to either be solved, or to be too out of reach to bother stopping and thinking about.

▲

great_psy 2 hours ago | parent | prev | next [-]

Maybe I’m not creative enough to see the potential, but what value does this bring ?

Given the example I saw about CRISPR, what does this model give over a different, non explaining model in the output ? Does it really make me more confident in the output if I know the data came from Arxiv or Wikipedia ?

I find the LLM outputs are subtlety wrong not obviously wrong

▲

voidhorse 2 hours ago | parent [-]

It makes the black box slightly more transparent. Knowing more in this regard allows us to be more precise—you go from prompt tweak witchcraft and divination to more of possible science and precise method.

▲

great_psy 2 hours ago | parent [-]

Can this method be extended to go down to the sentence level ?

In the example it shows how much of the reason for an answer is due to data from Wikipedia. Can it drill down to show paragraph or sentence level that influences the answer ?

	▲	rickydroll 11 minutes ago \| parent [-]
		Your question should be "Can it drill down to show the paragraphs or sentences that influence the answer?" I believe that the plagiarism complaint about llm models comes from the assumption that there is a one-to-one relationship between training and answers. I think the real and delightfully messier situation is that there is a many-to-one relationship.

▲

rvz 2 hours ago | parent | prev [-]

Now this is something which is very interesting to see and might be the answer to the explainability issue with LLMs, which can unlock a lot more use-cases that are off limits.

We'll see.