▲ | namaria 12 hours ago | |
Oy vey not this paper again. "Our methods study the model indirectly using a more interpretable “replacement model,” which incompletely and imperfectly captures the original." "(...) we build a replacement model that approximately reproduces the activations of the original model using more interpretable components. Our replacement model is based on a cross-layer transcoder (CLT) architecture (...)" https://transformer-circuits.pub/2025/attribution-graphs/bio... "Remarkably, we can substitute our learned CLT features for the model's MLPs while matching the underlying model's outputs in ~50% of cases." "Our cross-layer transcoder is trained to mimic the activations of the underlying model at each layer. However, even when it accurately reconstructs the model’s activations, there is no guarantee that it does so via the same mechanisms." https://transformer-circuits.pub/2025/attribution-graphs/met... These two papers were designed to be used as the sort of argument that you're making. You point to a blog post that glazes over it. You have to click through the "Read the paper" to find a ~100 page paper, referencing another ~100 page paper to find any of these caveats. The blog post you linked doesn't even feature the words "replacement (model)" or any discussion of the reliability of this approach. Yet it is happy to make bold claims such as "we look inside Claude 3.5 Haiku, performing deep studies of simple tasks representative of ten crucial model behaviors" which is simply not true. Sure, they added to the blog post: "the mechanisms we do see may have some artifacts based on our tools which don't reflect what is going on in the underlying model" but that seems like a lot of indirection when the fact is that all observations commented in the papers and the blog posts are about nothing but such artifacts. |